COMPREHENSIVE GENERAL LUMI COURSE WARSAW, POLAND

## AMD HARDWARE AND SOFTWARE

SUYASH TANDON, JUSTIN CHANG, JULIO MAIA, NOEL CHALMERS, PAUL BAUMAN, NICHOLAS CURTIS, NICHOLAS MALAYA, ALESSANDRO FANFARILLO, JOSE NOUDOHOUENOU, CHIP FREITAG, DAMON MCDOUGALL, NOAH WOLFE, SAMUEL ANTAO, GEORGE MARKOMANOLIS, BOB ROBEY, GINA SITARAMAN

**JAKUB KURZAK - PRESENTER** 

ADVANCED MICRO DEVICES, INC.

AMD together we advance\_ slides on LUMI in /project/project\_465000644/Slides/AMD/ hands-on exercises: <u>https://hackmd.io/@sfantao/H1QU6xRR3</u> hands-on source code: /project/project\_465000644/Exercises/AMD/HPCTrainingExamples/

# AMD HARDWARE FOR HPC AND AI CDNA ARCHITECTURE



### AMD GPUS





Radeon<sup>™</sup> Graphics Cards RDNA architecture E.g.:

- o RX 6000 Series
- $\circ$  RX 7000 Series

AMD Instinct<sup>™</sup> Accelerators
CDNA architecture
E.g.:
MI100
MI200
MI300



### AMD IN HPC





#### Frontier@ORNL

- currently the largest machine in the world
- the first computer to cross 1 exaFLOPS
- o AMD EPYC CPUs
- AMD Instinct GPUs

#### LUMI@CSC

- currently the largest machine in Europe
- $\circ$  3<sup>rd</sup> fastest in the world
- o AMD EPYC CPUs
- o AMD Instinct GPUs

### AMD INSTINCT<sup>™</sup> MI200



## AMD INSTINCT<sup>™</sup> MI250X WORLD'S MOST ADVANCED DATA CENTER ACCELERATOR



https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf

AMD together we advance\_

### AMD INSTINCT<sup>™</sup> MI200



### 2ND GENERATION CDNA ARCHITECTURE TAILORED-BUILT FOR HPC & AI



## **MULTI-CHIP DESIGN**

#### TWO GPU DIES IN PACKAGE TO MAXIMIZE COMPUTE & DATA THROUGHPUT





## 2<sup>nd</sup> GENERATION MATRIX CORES

OPTIMIZED COMPUTE UNITS FOR SCIENTIFIC COMPUTING



DOUBLE PRECISON (FP64) MATRIX CORE THROUGHPUT REPRESENTATION

### AMD INSTINCT<sup>™</sup> MI200

|                                                                            | нвм2                                  |                                                                                                                                  | нвм2       |                                                                |                                                                | HBM2       |                               |                                                                                                                      | НВМ2       |                                                                                                        |
|----------------------------------------------------------------------------|---------------------------------------|----------------------------------------------------------------------------------------------------------------------------------|------------|----------------------------------------------------------------|----------------------------------------------------------------|------------|-------------------------------|----------------------------------------------------------------------------------------------------------------------|------------|--------------------------------------------------------------------------------------------------------|
| Infinity Fabric Link                                                       | Memory Phy                            | Memory Memory<br>Controller Controller                                                                                           | Memory Phy | VCN                                                            | VCN                                                            | Memory Phy | Memory<br>Controller          | Memory<br>Controller                                                                                                 | Memory Phy |                                                                                                        |
| bic Link Infinity Fabric Link Infinity Fabric Link Infinity Fabric or PCIe | A A A A A A A A A A A A A A A A A A A | Compute Engine Controllers Compute Engine Compute Engine Compute Engine Compute Engine CO CU |            | Infinity Fabric Link Infinity Fabric Link Infinity Fabric Link | Infinity Fabric Link Infinity Fabric Link Infinity Fabric Link |            | Compute Engine Compute Engine | Compute Engine Controllers |            | Infinity Fabric or PCIe Infinity Fabric Link Infinity Fabric Link Infinity Fabric Link Infinity Fabric |
|                                                                            | Memory Phy                            | Memory Memory<br>Controller Controller                                                                                           | Memory Phy | VCN                                                            | VCN                                                            | Memory Phy | Memory<br>Controller          | Memory<br>Controller                                                                                                 | Memory Phy | ric Link                                                                                               |
|                                                                            | НВМ2                                  |                                                                                                                                  | НВМ2       |                                                                |                                                                | НВМ2       |                               |                                                                                                                      | НВМ2       |                                                                                                        |
|                                                                            |                                       |                                                                                                                                  | INFIN      | IITY @                                                         | 🏷 FAB                                                          | RIC        |                               |                                                                                                                      |            |                                                                                                        |

### MI200 COMPUTE UNIT



#### each SIMD unit

- $\circ$  has 16 SIMD lanes
- operates on vectors (waves) of size 64
- handles up to 10 waves simultaneously

### AMD INSTINCT<sup>™</sup> MI300



The world's first integrated data center CPU + GPU

AMD INSTINCT™ MI300

Breakthrough architecture to power the exascale AI era



## **UNIFIED MEMORY APU ARCHITECTURE BENEFITS**

#### AMD CDNA<sup>™</sup> 2 Coherent Memory Architecture

- Simplifies
   Programming
- Low Overhead 3<sup>rd</sup> Gen Infinity Interconnect
- Industry Standard Modular Design



### AMD CDNA<sup>™</sup> 3 Unified Memory APU Architecture

- Eliminates Redundant Memory Copies
- High-Efficiency 4<sup>th</sup> Gen AMD Infinity Architecture
- Low TCO with Unified Memory APU Package



# AMD SOFTWARE FOR HPC AND AI ROCM PLATFORM



## AMD ROCm<sup>™</sup> Open Software Platform For GPU Compute



## **Open Software Platform For GPU Compute**

### AMDA ROCm

- Unlocked GPU Power To
   Accelerate Computational Tasks
- Optimized for HPC and Deep
   Learning Workloads at Scale
- Open Source Enabling Innovation,
   Differentiation, and Collaboration

| Benchmarks & App Support                               | Optimized Training/Inference Models & Applications |        |             |                      |                   |      |          |            |  |
|--------------------------------------------------------|----------------------------------------------------|--------|-------------|----------------------|-------------------|------|----------|------------|--|
|                                                        | MLPERF                                             | HF     | PL/HPCG     | Life S               | cience Geo Scienc |      | ence     | ce Physics |  |
| Operating Systems Support                              | RHEL                                               |        | CentOS      |                      | SLES              |      |          | Ubuntu     |  |
| Cluster Deployment                                     | Singulari                                          | ty     | Kubernetes® |                      | Docker®           |      | SLURM    |            |  |
| Framework Support                                      | ork Support Kokkos/RA                              |        | PyTor       |                      | brch              |      | Ten      | TensorFlow |  |
| Libraries                                              | BLAS                                               | RANI   |             | FFT                  | MIGrap            |      | IVisionX | PRIM       |  |
|                                                        | SOLVER                                             | ALUTIO | NC          | SPARSE               | THRUS             | ST N | 110pen   | RCCL       |  |
| Programming Models                                     | OpenMP <sup>®</sup> API                            |        | Ope         | OpenCL™ HIP API      |                   |      |          |            |  |
| Development Toolchain                                  | Compiler                                           | Profil | er          | Tracer               | Debug             | ger  | hipify   | GPUFort    |  |
| Drivers & Runtime GPU Device Drivers and ROCm Run-Time |                                                    |        |             |                      |                   |      |          |            |  |
| Deployment Tools                                       | ROCm Validation St                                 |        | ite         | ROCm Data Center Too |                   | ool  | ROCm SMI |            |  |

together we advance\_

# AMDA ROCM 5.0 Democratizing exascale for all

| EXPANDING                                    | OPTIMIZING                                     | ENABLING                                        |  |  |
|----------------------------------------------|------------------------------------------------|-------------------------------------------------|--|--|
| SUPPORT & ACCESS                             | PERFORMANCE                                    | DEVELOPER SUCCESS                               |  |  |
| <ul> <li>Support for Radeon Pro</li></ul>    | <ul> <li>MI200 Optimizations: FP64</li></ul>   | <ul> <li>HPC Apps &amp; ML Frameworks</li></ul> |  |  |
| W6800 Workstation GPUs                       | Matrix ops, Improved Cache                     | on AMD InfinityHub                              |  |  |
| <ul> <li>Remote access through the</li></ul> | <ul> <li>Improved launch latency and</li></ul> | <ul> <li>Streamlined and improved</li></ul>     |  |  |
| AMD Accelerator Cloud                        | kernel performance                             | tools increasing productivity                   |  |  |

#### Public

### LIBRARIES

#### rocBLAS / hipBLAS

• basic operations on dense matrices

#### rocSOLVER

 $\circ$  dense linear algebra solvers

### rocSPARSE / hipSPARSE

 $\circ$   $\,$  basic operations on sparse matrices

### rocALUTION

 $\circ$  sparse linear algebra solvers

### rocFFT / hipFFT

• Fast Fourier transforms

### rocRAND / hipRAND

 $\circ$  random number generation

#### rocPRIM / hipCUB / rocThrust

 $\circ$  scan, sort, reduction, etc.

https://github.com/ROCmSoftwarePlatform/rocBLAS https://github.com/ROCmSoftwarePlatform/hipBLAS

https://github.com/ROCmSoftwarePlatform/rocSOLVER

https://github.com/ROCmSoftwarePlatform/rocSPARSE https://github.com/ROCmSoftwarePlatform/hipSPARSE

https://github.com/ROCmSoftwarePlatform/rocALUTION

https://github.com/ROCmSoftwarePlatform/rocFFT https://github.com/ROCmSoftwarePlatform/hipFFT

https://github.com/ROCmSoftwarePlatform/rocRAND https://github.com/ROCmSoftwarePlatform/hipRAND

https://github.com/ROCmSoftwarePlatform/rocPRIM https://github.com/ROCmSoftwarePlatform/hipCUB https://github.com/ROCmSoftwarePlatform/rocThrust

### **ALSO OPEN SOURCE**

#### the compiler

o <u>https://github.com/ROCmSoftwarePlatform/llvm-project</u>

#### the runtime

<u>https://github.com/RadeonOpenCompute/ROCR-Runtime</u>

#### the debugger

o <u>https://github.com/ROCm-Developer-Tools/ROCgdb</u>

### the profiler

o <u>https://github.com/ROCm-Developer-Tools/rocprofiler</u>

### the HPL benchmark

o <u>https://github.com/ROCmSoftwarePlatform/rocHPL</u>

### the HPCG benchmark

o <u>https://github.com/ROCmSoftwarePlatform/rocHPCG</u>

etc.

# AMD SOFTWARE FOR HPC AND AI HIP PROGRAMMING



### **GPU ACCELERATION** HOST AND DEVICE

#### the host is the CPU

- host code runs here
- usual C++ syntax and features
- entry point is the "main" function
- $\circ~$  use the HIP API to
  - create device buffers
  - $\circ$  moved data between host and device
  - launch device code

#### the device is the GPU

- $\circ$  device code runs here
- C/C++ syntax and features
- o device code is launched as "kernels"
- o instructions from the host are sent to streams

together we advance\_



### FUNCTION QUALIFIERS

HOST AND DEVICE

#### \_\_global\_\_

o **"kernels"** 

- $\circ$   $\,$  execute the GPU  $\,$
- $\circ$   $\,$  can be called from the CPU  $\,$

#### \_\_device\_\_

- $\circ$  execute the GPU
- can be called from device code (kernels or a \_\_device\_\_ functions)

#### \_\_host\_\_ \_\_device\_\_

- $\circ$   $\,$  executes on the CPU when called from CPU code
- $\circ$   $\,$  executes on the GPU when called from GPU code

# HIP KERNEL LANGUAGE

#### in 2D

- $\circ$  each colored box is a block
- o each block has an index blockIdx.[xyz]
- $\circ$   $\,$  each small square is a thread
- $\circ$  each thread has a 2D index threadIdx.  $[\,xyz\,]$
- o grid dimensions in blockDim. [xyz]





### HIP KERNEL LANGUAGE GPU CODE

- all local variables and arrays are thread-private
- threads can exchange data through shared memory (LDS)
- o declare using the \_\_shared\_\_ keyword
- o use \_\_syncthreads() to synchronize





# HIP KERNEL LANGUAGE

#### saxpy loop

- $\circ$  two 1D arrays
- $\circ$  they[i] += a\*x[i] operation
- mapped to 1D grid of threads/blocks
- $\circ$  each thread takes on index

### HIP API MEMORY MANAGEMENT

| hipError_t hip | Malloc (void **ptr, size_t size)                                                                                                                                                                        |
|----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                |                                                                                                                                                                                                         |
| hipError_t hip | oFree (void *ptr)                                                                                                                                                                                       |
| hipI           | e memory allocated by the hcc hip memory allocation API. This API performs an implicit<br>DeviceSynchronize() call. If pointer is NULL, the hip runtime is initialized and hipSuccess is<br>urned. More |
| hipError_t hip | <pre>Memcpy (void *dst, const void *src, size_t sizeBytes, hipMemcpyKind kind)</pre>                                                                                                                    |
| Сор            | by data from src to dst. More                                                                                                                                                                           |

- GPU operates on GPU memory
- $\circ$   $\,$  need to allocate GPU memory  $\,$
- need to copy data between the CPU memory and the GPU memory

https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group\_\_\_\_memory.html

### HIP API ERROR HANDLING

- $\circ$  check last error
- o get error name
- $\circ$  get error string

| hipError_t   | hipGetLastError (void)                                                                                         |
|--------------|----------------------------------------------------------------------------------------------------------------|
|              | Return last error returned by any HIP runtime API call and resets the stored error code to hipSuccess.<br>More |
| hipError_t   | hipPeekAtLastError (void)                                                                                      |
|              | Return last error returned by any HIP runtime API call. More                                                   |
| const char * | hipGetErrorName (hipError_t hip_error)                                                                         |
|              | Return hip error as text string form. More                                                                     |
| const char * | hipGetErrorString (hipError_t hipError)                                                                        |
|              | Return handy text string message to explain the error which occurred. More                                     |

https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group\_\_\_error.html

### HIP API DEVICE MANAGEMENT

- $\circ$  check number of devices
- $\circ$  switch devices
- $\circ$  synchronize devices

| hipError_t | hipDeviceSynchronize (void)                                                       |
|------------|-----------------------------------------------------------------------------------|
|            | Waits on all active streams on current device. More                               |
| hipError_t | hipDeviceReset (void)                                                             |
|            | The state of current device is discarded and updated to a fresh state. More       |
| hipError_t | hipSetDevice (int deviceId)                                                       |
|            | Set default device to be used for subsequent hip API calls from this thread. More |
| hipError_t | hipGetDevice (int *deviceId)                                                      |
|            | Return the default device id for the calling host thread. More                    |
| hipError_t | hipGetDeviceCount (int *count)                                                    |
|            | Return number of compute-capable devices. More                                    |

https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group\_\_\_\_device.html

### HIP API STREAM MANAGEMENT

- o create stream
- o destroy stream
- o synchronize stream

| hipError_t | hipStreamCreate (hipStream_t <b>*stream</b> )     |
|------------|---------------------------------------------------|
|            | Create an asynchronous stream. More               |
| hipError_t | hipStreamDestroy (hipStream_t stream)             |
|            | Destroys the specified stream. More               |
| hipError_t | hipStreamSynchronize (hipStream_t stream)         |
|            | Wait for all commands in stream to complete. More |

- $\circ$  etc.
- $\circ$  etc.
- $\circ$  etc.

https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group\_stream.html



### AMD LINGO





| 1<br>2             | #include <cuda.h></cuda.h>                                                                  |
|--------------------|---------------------------------------------------------------------------------------------|
| -<br>3<br>4        | constant <i>float</i> a = 2.0 <i>f</i> ;                                                    |
| 5<br>6<br>7        | global<br><i>void</i> saxpy( <i>int</i> n, <i>float</i> const* x, <i>float</i> * y)<br>{    |
| 8<br>9<br>10<br>11 | <pre>int i = blockDim.x*blockIdx.x + threadIdx.x; if (i &lt; n)     y[i] += a*x[i]; }</pre> |

- $\circ$  vector addition kernel in CUDA
- each thread takes one array index
- $\circ$  and performs one multiply-and-add operation

### SIMPLE SAXPY KERNEL

```
ADDING THE CPU CODE
    #include <cuda.h>
    \_constant_{float} a = 2.0f;
    __global__
    void saxpy(int n, float const* x, float* y)
        int i = blockDim.x*blockIdx.x + threadIdx.x;
        if (i < n)
            y[i] += a*x[i];
    int main()
14
        int n = 256;
        std::size_t size = sizeof(float)*n;
        float* d_x;
        float* d_y;
        cudaMalloc(&d_x, size);
                                                                      allocate arrays in device memory
        cudaMalloc(&d_y, size);
        int num_blocks = 2;
                                                                      set up the grid
        int num_threads = 128;
24
                                                                      launch the kernel
        saxpy<<<num_blocks, num_threads>>>(n, d_x, d_y); <-</pre>
        cudaDeviceSynchronize();
```



```
ADDING HOST↔ DEVICE COPIES
    #include <cuda.h>
    \_constant_{float} a = 2.0f;
    __global__
    void saxpy(int n, float const* x, float* y)
        int i = blockDim.x*blockIdx.x + threadIdx.x;
        if (i < n)
            y[i] += a*x[i];
    int main()
14
        int n = 256;
        std::size_t size = sizeof(float)*n;
        float* h_x = (float*)malloc(size);
                                                                     allocate arrays in host memory
        float* h_y = (float*)malloc(size);
        float* d_x;
        float* d_y;
        cudaMalloc(&d_x, size);
        cudaMalloc(&d_y, size);
24
        cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice);
                                                                     copy content to device memory
        cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice);
        int num_blocks = 2;
        int num_threads = 128;
        saxpy<<<<num_blocks, num_threads>>>(n, d_x, d_y);
                                                                     copy results back to host memory
        cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost);
        cudaDeviceSynchronize();
```

AMD together we advance\_

```
Public]
```

```
\_constant_{float} a = 2.0f;
__global__
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
       y[i] += a*x[i];
int main()
   std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
   float* d_x;
   float* d_y;
   cudaMalloc(&d_x, size);
   cudaMalloc(&d_y, size);
   cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice);
   cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice);
   int num_blocks = 2;
   int num_threads = 128;
   saxpy <<< num_blocks, num_threads>>>> (n, d_x, d_y);
   cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost);
   cudaDeviceSynchronize();
   cudaFree(d_x);
                                                                   free arrays in device memory
   cudaFree(d_y);
    free(h_x);
                                                                   free arrays in host memory
    free(h_y);
```

### **ADDING MEMORY CLEANUP**

```
#include <cassert>
\__constant__ float a = 2.0f;
__global__
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
        y[i] += a*x[i];
#define CHECK(call) assert(call == cudaSuccess) 
int main()
    std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
    assert(h_x != nullptr);
    assert(h_y != nullptr);
    float* d_x;
    float* d_y;
    CHECK(cudaMalloc(&d_x, size));
    CHECK(cudaMalloc(&d_y, size));
    CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
    int num_blocks = 2;
    int num_threads = 128;
    saxpy<<<<num_blocks, num_threads>>>(n, d_x, d_y);
    CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
    CHECK(cudaDeviceSynchronize());
    CHECK(cudaFree(d_x));
    CHECK(cudaFree(d_y));
    free(h_x);
    free(h_y);
```

### **ADDING ERROR CHECKS**

simple error checking macro

#### simple CUDA code

```
#include <cassert>
\_constant_{float} a = 2.0f;
__global__
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
        y[i] += a*x[i];
#define CHECK(call) assert(call == cudaSuccess)
int main()
    std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
    assert(h_x != nullptr);
    assert(h_y != nullptr);
    float* d_x;
    float* d_y;
    CHECK(cudaMalloc(&d_x, size));
    CHECK(cudaMalloc(&d_y, size));
    CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
    int num_blocks = 2;
    int num_threads = 128;
    saxpy<<<<num_blocks, num_threads>>>(n, d_x, d_y);
    CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
    CHECK(cudaDeviceSynchronize());
    CHECK(cudaFree(d_x));
    CHECK(cudaFree(d_y));
    free(h_x);
    free(h_y);
```

## simple CUDA code

```
#include <cassert>
\_constant_{float} a = 2.0f;
__global__
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
        y[i] += a * x[i];
#define CHECK(call) assert(call == cudaSuccess)
int main()
    std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
    assert(h_x != nullptr);
    assert(h_y != nullptr);
    float* d_x;
    float* d_y;
    CHECK(cudaMalloc(&d_x, size));
    CHECK(cudaMalloc(&d_y, size));
    CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
    int num_blocks = 2;
    int num_threads = 128;
   saxpy <<< num_blocks, num_threads>>> (n, d_x, d_y);
    CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
    CHECK(cudaDeviceSynchronize());
   CHECK(cudaFree(d_x));
    CHECK(cudaFree(d_y));
    free(h_x);
    free(h_y);
```

### same code in HIP

```
#include <cassert>
\_constant_{float} a = 2.0f;
__global__
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
       y[i] += a * x[i];
#define CHECK(call) assert(call == hipSuccess)
int main()
    std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
   assert(h_x != nullptr);
   assert(h_y != nullptr);
    float* d_x;
    float* d_y;
   CHECK(hipMalloc(&d_x, size));
   CHECK(hipMalloc(&d_y, size));
    CHECK(hipMemcpy(d_x, h_x, size, hipMemcpyHostToDevice));
   CHECK(hipMemcpy(d_y, h_y, size, hipMemcpyHostToDevice));
    int num_blocks = 2;
    int num_threads = 128;
    saxpy < < <num_blocks, num_threads>>> (n, d_x, d_y);
   CHECK(hipMemcpy(h_y, d_y, size, hipMemcpyDeviceToHost));
    CHECK(hipDeviceSynchronize());
    CHECK(hipFree(d_x));
   CHECK(hipFree(d_y));
    free(h_x);
    free(h_y);
```

# spot the differences

## simple CUDA code

```
#include <cassert>
\_constant_{float} a = 2.0f;
__global__
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
        y[i] += a * x[i];
#define CHECK(call) assert(call == cudaSuccess)
int main()
    std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
    assert(h_x != nullptr);
    assert(h_y != nullptr);
    float* d_x;
    float* d_y;
    CHECK(cudaMalloc(&d_x, size));
    CHECK(cudaMalloc(&d_y, size));
    CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
    int num_blocks = 2;
    int num_threads = 128;
   saxpy <<< num_blocks, num_threads>>> (n, d_x, d_y);
    CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
    CHECK(cudaDeviceSynchronize());
   CHECK(cudaFree(d_x));
    CHECK(cudaFree(d_y));
    free(h_x);
    free(h_y);
```

### same code in HIP

```
#include <hip/hip_runtime.h>
#include <cassert>
\_constant_{float} a = 2.0f;
__global__
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
       y[i] += a * x[i];
#define CHECK(call) assert(call == hipSuccess)
int main()
    std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
   assert(h_x != nullptr);
   assert(h_y != nullptr);
    float* d_x;
    float* d_y;
   CHECK(hipMalloc(&d_x, size));
   CHECK(hipMalloc(&d_y, size));
   CHECK(hipMemcpy(d_x, h_x, size, hipMemcpyHostToDevice));
    CHECK(hipMemcpy(d_y, h_y, size, hipMemcpyHostToDevice));
    int num_blocks = 2;
    int num_threads = 128;
    saxpy < < <num_blocks, num_threads>>> (n, d_x, d_y);
   CHECK(hipMemcpy(h_y, d_y, size, hipMemcpyDeviceToHost));
   CHECK(hipDeviceSynchronize());
   CHECK(hipFree(d_x));
   CHECK(hipFree(d_y));
    free(h_x);
    free(h_y);
```

# **HIPIFY TOOLS**

## hipify-clang

- compiler (clang) based translator
- handles very complex constructs
- $\circ$  prints an error if not able to translate
- $\circ$  supports clang options
- requires CUDA

## hipify-perl

- Perl<sup>®</sup> script
- $\circ$  relies on regular expressions
- $\circ$  may struggle with complex constructs
- $\circ$  does not require CUDA

## https://github.com/ROCm-Developer-Tools/HIPIFY

```
Public]
```

```
#include <cassert>
\_constant_{float} a = 2.0f;
__global_
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
        y[i] += a * x[i];
#define CHECK(call) assert(call == cudaSuccess)
int main()
    std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
    assert(h_x != nullptr);
    assert(h_y != nullptr);
    float* d_x;
    float* d_y;
    CHECK(cudaMalloc(&d_x, size));
    CHECK(cudaMalloc(&d_y, size));
    CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
    int num_blocks = 2;
    int num_threads = 128;
    saxpy <<< num_blocks, num_threads>>> (n, d_x, d_y);
    CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
    CHECK(cudaDeviceSynchronize());
    CHECK(cudaFree(d_x));
    CHECK(cudaFree(d_y));
    free(h_x);
    free(h_y);
```

saxpy\$ perl /opt/rocm/bin/hipify-perl -examin saxpy.cu

```
[HIPIFY] info: file 'saxpy.cu' statisitics:
  CONVERTED refs count: 13
  TOTAL lines of code: 46
  WARNINGS: 0
[HIPIFY] info: CONVERTED refs by names:
    cuda.h => hip/hip_runtime.h: 1
    cudaDeviceSynchronize => hipDeviceSynchronize: 1
    cudaFree => hipFree: 2
    cudaMalloc => hipMalloc: 2
    cudaMemcpy => hipMemcpy: 3
    cudaMemcpyDeviceToHost => hipMemcpyDeviceToHost: 1
    cudaMemcpyHostToDevice => hipMemcpyHostToDevice: 2
    cudaSuccess => hipSuccess: 1
saxpy$
```

# hipify-perl

## hipify-perl -examin

- o for initial assessment
- no replacements done
- prints basic statistics and the number of replacements

```
Public]
```

```
#include <cassert>
\_constant_{float} a = 2.0f;
__global_
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
        y[i] += a*x[i];
#define CHECK(call) assert(call == cudaSuccess)
int main()
    std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
    assert(h_x != nullptr);
    assert(h_y != nullptr);
    float* d_x;
    float* d_y;
    CHECK(cudaMalloc(&d_x, size));
    CHECK(cudaMalloc(&d_y, size));
    CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
    int num_blocks = 2;
    int num_threads = 128;
    saxpy < < <num_blocks, num_threads>>> (n, d_x, d_y);
    CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
    CHECK(cudaDeviceSynchronize());
   CHECK(cudaFree(d_x));
    CHECK(cudaFree(d_y));
    free(h_x);
    free(h_y);
```

saxpy\$ perl /opt/rocm/bin/hipify-perl saxpy.cu
#include "hip/hip\_runtime.h"
#include <hip/hip\_runtime.h>
#include <cassert>

\_\_constant\_\_ float a = 2.0f;

```
__global__
void saxpy(int n, float const* x, float* y)
{
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
        y[i] += a*x[i];
}</pre>
```

#define CHECK(call) assert(call == hipSuccess)

```
int main()
{
    int n = 256;
    std::size_t size = sizeof(float)*n;
```

```
float* h_x = (float*)malloc(size);
float* h_y = (float*)malloc(size);
assert(h_x != nullptr);
assert(h_y != nullptr);
```

```
float* d_x;
float* d_y;
CHECK(hipMalloc(&d_x, size));
CHECK(hipMalloc(&d_y, size));
```

CHECK(hipMemcpy(d\_x, h\_x, size, hipMemcpyHostToDevice)); CHECK(hipMemcpy(d\_y, h\_y, size, hipMemcpyHostToDevice));

int num\_blocks = 2; int num\_threads = 128; saxpy<<<<num\_blocks, num\_threads>>>(n, d\_x, d\_y);

CHECK(hipMemcpy(h\_y, d\_y, size, hipMemcpyDeviceToHost)); CHECK(hipDeviceSynchronize());

CHECK(hipFree(d\_x)); CHECK(hipFree(d\_y));

free(h\_x); free(h\_y);

saxpy\$

# hipify-perl

translating a file to standard output

#### but can also

- o translate in place
- preserve orig copy
- recursively do folders

```
AMD
together we advance_
```

```
Public]
```

```
#include <cassert>
#include "cuda2hip.h" <
__constant__ float a = 2.0f;
__global__
    int i = blockDim.x*blockIdx.x + threadIdx.x;
        y[i] += a*x[i];
#define CHECK(call) assert(call == cudaSuccess)
int main()
    std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
    assert(h_x != nullptr);
    assert(h_y != nullptr);
    float* d_x;
    float* d_y;
    CHECK(cudaMalloc(&d_x, size));
    CHECK(cudaMalloc(&d_y, size));
    CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
     CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
    int num_blocks = 2;
    int num_threads = 128;
    saxpy << < num_blocks, num_threads>>> (n, d_x, d_y);
    CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
    CHECK(cudaDeviceSynchronize());
    CHECK(cudaFree(d_x));
    CHECK(cudaFree(d_y));
     free(h_x);
    free(h_y);
```

|   | #define | cudaSuccess            | hipSuccess            |
|---|---------|------------------------|-----------------------|
| 2 | #define | cudaMalloc             | hipMalloc             |
| 3 | #define | cudaMemcpy             | hipMemcpy             |
|   | #define | cudaMemcpyHostToDevice | hipMemcpyHostToDevice |
| 5 | #define | cudaMemcpyDeviceToHost | hipMemcpyDeviceToHost |
| 6 | #define | cudaDeviceSynchronize  | hipDeviceSynchronize  |
| 7 | #define | cudaFree               | hipFree               |
| 8 |         |                        |                       |

## alternatively

- create a file with renaming macros
- include conditionally, depending on target

# **OPTIMIZATION TECHNIQUES**

## basic

- thread divergence / SIMDzation
- reuse in shared memory & bank conflicts
- coalescing of global memory accesses
- resource partitioning / occupancy / spills
- $\circ~$  L1, L2 cache blocking

0 ...

## advanced

- $\circ$  atomics
- $\circ$  warp primitives
- CPU-GPU coherence
- inter-stream synchronization
- 0 ...

```
#include <cassert>
#include <cstdlib>
#include <cstdio>
int main()
    int n = 256;
    std::size_t size = sizeof(float)*n;
    float* x = (float*)malloc(size);
    float* y = (float*)malloc(size);
    assert(x != nullptr);
    assert(y != nullptr);
    for (int i = 0; i < n; ++i)
        y[i] += a*x[i];
    free(x);
    free(y);
```

## alternatively

 $\circ$  just write CPU code

| 2<br>3<br>4                 | include <cassert><br/>include <cstdlib><br/>include <cstdio><br/>include <omp.h></omp.h></cstdio></cstdlib></cassert>                            |
|-----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| 5<br>6<br>7                 | const float a = 2.0f;                                                                                                                            |
| 8<br>9                      | nt main()                                                                                                                                        |
| 10<br>11<br>12              | <pre>int n = 256;<br/>std::size_t size = sizeof(float)*n;</pre>                                                                                  |
| 13<br>14<br>15<br>16<br>17  | <pre>float* x = (float*)malloc(size); float* y = (float*)malloc(size); assert(x != nullptr); assert(y != nullptr);</pre>                         |
| 18<br>19<br>20<br>21        | <pre>#pragma omp target teams distribute parallel for map(to:x[0:n]) map(tofrom:y[0:n]) for (int i = 0; i &lt; n; ++i)     y[i] += a*x[i];</pre> |
| 22<br>23<br>24<br><b>25</b> | free(x);<br>free(y);                                                                                                                             |

## alternatively

- $\circ$  just write CPU code
- use OpenMP<sup>®</sup> target offload constructs

# **KOKKOS AND RAJA**

- portability frameworks based on C++
- portability to CPUs & GPUs AMD, Intel<sup>®</sup>, NVIDIA
- basic parallel processing constructs
- o multidimensional arrays
- etc., etc., etc.

## Kokkos

- $\,\circ\,\,$  originates from Sandia National Laboratory
- o https://kokkos.org/
- o <u>https://github.com/kokkos</u>

## RAJA

- $\circ$  originates from Lawrence Livermore
- o <u>https://raja.readthedocs.io</u>
- o <u>https://github.com/LLNL/RAJA</u>

#### Public]

# **DIFFERENCES FROM CUDA**

- o warpSize
  - $\circ$  64 on AMD
  - $\circ$  32 on NVIDIA
- o dynamic parallelism not supported
- $\circ$  exercise caution:
  - $\circ$  atomics
  - managed memory
  - warp-level primitives
  - $\circ$  inter-process communication

# AMD RESOURCES DOCUMENTATION AND TRAINING



## AMD ROCM DEVELOPER HUB

## Engage with ROCm Experts

Participate in ROCm Webinar Series Post questions, view FAQ's in Community Forum

## Increase Understanding

Purchase ROCm Text Book View the latest news in the Blogs

## Get Started Using ROCm

ROCm Documentation on GitHub Download the Latest Version of ROCm

https://www.amd.com/en/developer/rocm-hub.html

Intro to ROCm Start Using ROCm Community Infinity Hub Blogs

AMD ROCm<sup>™</sup> is an open software platform for accelerated compute offering a code once, use everywhere approach. Access all ROCm developer resources here — from documentation, to training webinars, to the latest blogs, and more.

#### Get to Know ROCm™



 Training Webinars

 Register for an upcoming ROCm training webinar or view previous webinars on-demand.

 Register Now [2]



ROCm Textbook Learn about using ROCm with the "Accelerated Computing with HIP" textbook. Get Now [2]



ROCm Docs Find the latest documentation on all ROCm releases.

See Docs 🖸





#### Ask the ROCm Community

This community is for ROCm users to come together to learn, share experiences, and help solve issues using the ROCm platform.

Learn More 🗳



## **NEW ROCM DOCS**

## Comprehensive Coverage

Compilers and Frameworks Math libraries, communication libraries Management tools, validation tools

Howto Guides

Installation Tunning

Debugging

https://rocm.docs.amd.com/

| GitHub Community                                                                                                                                                                                | AMD Lab Notes                                                                                                                                                                 | Infinity Hub                                                      | Support | Feedback                                                                                              |                                                                  |     |   |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------|---------|-------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|-----|---|
| ROCm Documentation<br>Home<br>What is ROCm?<br>Deploy ROCm<br>Linux Quick Start                                                                                                                 | AMD RO Applies to Linux (2) What is ROCm?                                                                                                                                     | Cm™ D                                                             |         | ntation                                                                                               | Release Info                                                     |     | ۵ |
| Linux Overview 🗸 🗸                                                                                                                                                                              |                                                                                                                                                                               | ~                                                                 |         | ~                                                                                                     |                                                                  | Ť   |   |
| Release Info   Release Notes   Changelog   GPU Support and OS Compatibility   (Linux)   Known Issues rate   Compatibility   Licensing Terms   APIs and Reference   All Reference Material   HIP | APIs and Reference<br>Compilers and<br>HIP<br>OpenMP<br>Math Libraries<br>C++ Primitives<br>Communicatio<br>AI Libraries<br>Computer Visio<br>Management T<br>Validation Tool | Development Tools<br>: Libraries<br>n Libraries<br>on<br>fools    |         | Understand ROC<br>• Compiler Dis<br>• Using CMak<br>• Linux Folder<br>• GPU Isolatic<br>• GPU Archite | sambiguation<br>e<br>: Structure Reorganization<br>on Techniques |     |   |
| Math LibrariesC++ Primitive LibrariesCommunication LibrariesAl LibrariesComputer VisionOpenMPCompilers and ToolsManagement ToolsValidation Tools                                                | <ul> <li>GPU Aware MI</li> <li>Setting up for I</li> <li>Magma In</li> <li>PyTorch Ir</li> </ul>                                                                              | Deep Learning with<br>stallation<br>nstallation<br>w Installation |         |                                                                                                       | ·                                                                |     |   |
| Understand ROCm<br>All Explanation Material<br>Compiler Disambiguation                                                                                                                          |                                                                                                                                                                               |                                                                   |         |                                                                                                       | What is                                                          | Nex | 2 |

together we advance\_

AMDZ ROCm<sup>™</sup> Platform 5.6.0

## HIP TEXTBOOK

## Comprehensive Coverage

HIP Language AMD GPU Internals Performance Analysis Debugging Programming Patterns ROCm Libraries Porting to HIP Multi-GPU Programming Third Party Tools CDNA Assembly



https://www.barnesandnoble.com/w/accelerated-computing-with-hip-yifan-sun/1142866934

## **ISA REFERENCE GUIDE**

## Public ISA

The Instruction Set Architecture is public There is no intermediate layer like PTX You can write assembly code You can compile to assembly for inspection

# 

#### "AMD Instinct MI200" Instruction Set Architecture Reference Guide

4-February-2022

AMD together we advance\_

https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/instinct-mi200-cdna2-instruction-set-architecture.pdf

## **AMD INFINITY HUB**

## AMD Instinct<sup>™</sup> MI200 SUPPORT

29 key applications & frameworks on Infinity Hub & a catalogue supporting over 90 applications, frameworks & tools

## Accelerating Instinct<sup>™</sup> adoption

Over 17000 application pulls. 10000+ since last year

## PERFORMANCE RESULTS

Published Performance Results for Select Apps / Benchmarks

https://www.amd.com/en/technologies/infinity-hub

#### AMD Infinity Hub



AMD together we advance\_

## SOFTWARE CATALOG

STRONG MOMENTUM AND INCREASING LIST OF SUPORTED APPLICATION, LIBRARIES & FRAMEWORKS

| Life Science                                             | Physics                                                                         | Chemistry                                             | CFD                                                                                              | Earth Science                                                                                      |
|----------------------------------------------------------|---------------------------------------------------------------------------------|-------------------------------------------------------|--------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
| AMBER<br>GROMACS<br>NAMD<br>LAMMPS<br>Hoomd-Blue<br>VASP | MILC<br>GRID<br>QUANTUM ESPRESSO<br>N-Body<br>CHROMA<br>PIConGPU<br>QuickSilver | CP2K<br>QUDA<br>NWCHEM<br>TERACHEM<br>QMCPACK         | OpenFOAM®<br>AMR-WIND<br>NEKBONE<br>LAGHOS<br>NEKO<br>NEKRS<br>PeleC                             | EXAGO<br>DEVITO<br>OCCA<br>SPECFEM3D-GLOBE<br>SPECFEM3D-CARTESIAN<br>ACECAST (WRF)<br>MPAS<br>ICON |
| Benchmarks                                               | Libraries                                                                       | ML Frameworks                                         | ISV Applications                                                                                 |                                                                                                    |
| HPL<br>HPCG<br>AMG<br>ML - TORCHBENCH<br>ML - SUPERBENCH | AMR-EX<br>Ginkko<br>HYPRE<br>TRILINOS                                           | PYTORCH<br>TENSORFLOW<br>JAX<br>ONNX<br>OPENAI TRITON | ANSYS MECHANICAL<br>CADENCE CHARLES<br>ANSYS FLUENT*<br>SIEMENS® STAR-CCM+*<br>SIEMENS® CALIBRE* | + MANY MORE                                                                                        |

## AMD LAB NOTES



Search GPUOpen...

Q

知乎 🕒 🎽 🚇 斎

HOME SOFT

SOFTWARE 🗸 🛛 DOCS 🗸

 Image: Control of the second secon

Home » Blogs » AMD lab notes » AMD matrix cores



• Finite difference method - Laplacian part 1

• Finite difference method - Laplacian part 2

• Finite difference method - Laplacian part 3

Finite difference method - Laplacian part 4

Example 1 - V\_MFMA\_F32\_16x16x4F32 Example 2 - V\_MFMA\_F32\_16x16x1F32 Example 3 - V\_MFMA\_F64\_4x4x4F64 A note on rocWMMA

Introduction to profiling tools for AMD

AMD Instinct<sup>™</sup> MI200 GPU memory space

Gegister pressure in AMD CDNA™2 GPUs
 GPU-aware MPI with ROCm

A note on the AMD Matrix Instruction Calculator ...

AMD lab notes

AMD matrix cores

References

overview

Search this manual

Using AMD matrix cores MFMA compiler intrinsic syntax

#### AMD matrix cores 🖉

Matrix multiplication is a fundamental aspect of Linear Algebra and it is an ubiquitous computation within High Performance Computing (HPC) Applications. Since the introduction of AMD's CDNA Architecture, Generalized Matrix Multiplication (GEMM) computations are now hardware-accelerated through Matrix Core Processing Units. Matrix Core accelerated GEMM kernels lie at the heart of BLAS libraries like rocBLAS but they can also be programmed directly by developers. Applications that are throughput bound by GEMM computation can achieve additional speedups by utilizing Matrix Cores.

AMD's Matrix Core technology supports a full range of mixed precision operations bringing us the ability to work with large models and enhance memory-bound operation performance for any combination of AI and machine learning workloads. The various numerical formats have uses in different applications. Examples include use of 8-bit integers (INT8) for ML inference, 32-bit floating point (FP32) data for ML Training and HPC applications, 16-bit floating point (FP16) data for graphics workloads and 16-bit brain float (BF16) data for ML training with fewer convergence issues.

To learn more about the theoretical speedups achievable by using matrix cores compared to SIMD Vector Units, please refer to the tables below. The tables list the performance of the Vector (i.e. Fused Multiply-Add or FMA) and Matrix core units of the previous generation (MI100) and current generation (MI250X) of CDNA Accelerators.

Matrix Core Performance for MI100 and MI250X:

| Data format | MI100 Flops/Clock/CU | MI250X Flops/Clock/CU |
|-------------|----------------------|-----------------------|
| FP64        | N/A                  | 256                   |
| FP32        | 256                  | 256                   |
| FP16        | 1024                 | 1024                  |
| BF16        | 512                  | 1024                  |
| INT8        | 1024                 | 1024                  |

Vector (FMA) Unit Performance for MI100 and MI250X:

| Data format | MI100 Flops/Clock/CU | MI250X Flops/Clock/CU |  |
|-------------|----------------------|-----------------------|--|
| FP64        | 64                   | 128                   |  |
| FP32        | 128                  | 128                   |  |

## **Introductory Topics**

ROCm installation Basics of HIP programming

## Advanced Topics

Matrix Cores Register pressure GPU-aware MPI

https://gpuopen.com/learn/amd-lab-notes/ https://github.com/AMD/amd-lab-notes

## **OLCF TRAINING**

## Tutorials, Workshops, Hackathons

Slides available online Recordings available online

## **User Guides**

Frontier User Guide Crusher User Guide

https://www.olcf.ornl.gov/for-users/training/



## TRAINING

Find upcoming training events and archives of training materials detailing how to use OLCF resources and services.

#### GETTING STARTED

#### HOME / FOR USERS / TRAINING

SYSTEM USER GUIDES

Welcome to the OLCF Training page. Please follow the links below to find upcoming training events as well as archives of training materials.

#### TRAINING

#### OLCF POLICY GUIDE

DOCUMENTS & FORMS

CENTER STATUS



#### Training Calendar

Find upcoming and past training events presented either on-site or via webcast by the OLCF.

#### Contact Support

MYOLCF

Need assistance from a trained OLCF support staff member? We're

- here to help.
- Submit a Support Ticket
- Call: 865.241.6536
- ✓ Email: help@olcf.ornl.gov
- ♥ Status Tweets: @olcfstatus



Find both instructional and hands-on training materials meant to demonstrate the use of specific technologies. This includes online guides, recordings of past training events, and screencasts.







# **OLCF TRAINING ARCHIVE**

## Frontier Training

GPU Profiling GPU Debugging Node Performance Engineering Programming Models for AMD GPUs

# Produced by

AMD staff HPE staff ORNL staff

|          | A OLCF User Documentation                                                                                                 |
|----------|---------------------------------------------------------------------------------------------------------------------------|
| (        | Search docs                                                                                                               |
|          |                                                                                                                           |
|          | New User Quick Start                                                                                                      |
|          | Accounts and Projects                                                                                                     |
|          | Connecting                                                                                                                |
|          | Systems                                                                                                                   |
|          | Services and Applications                                                                                                 |
|          | Data Storage and Transfers                                                                                                |
|          | Software                                                                                                                  |
| _        |                                                                                                                           |
| Ξ        | Training                                                                                                                  |
|          | Training<br>OLCF Training Calendar 🖸                                                                                      |
|          |                                                                                                                           |
|          | OLCF Training Calendar 🗗                                                                                                  |
|          | OLCF Training Calendar 亿<br>OLCF Tutorials 亿                                                                              |
|          | OLCF Training Calendar<br>OLCF Tutorials<br>OLCF Training Archive                                                         |
| <b>₽</b> | OLCF Training Calendar<br>OLCF Tutorials<br>OLCF Training Archive<br>OLCF GPU Hackathons                                  |
| +        | OLCF Training Calendar<br>OLCF Tutorials<br>OLCF Training Archive<br>OLCF GPU Hackathons<br>OLCF Vimeo Channel            |
| +        | OLCF Training Calendar<br>OLCF Tutorials<br>OLCF Training Archive<br>OLCF GPU Hackathons<br>OLCF Vimeo Channel<br>Quantum |

**CAK RIDGE** National Laboratory

| 2023-02-17 | Checkpointing Tips                            | Scott Atchley, HPC Systems<br>Engineer, Distinguished R&D<br>Staff, ORNL                                | Frontier Training Workshop $^{\ensuremath{\mathcal{C}}}$ | (slides $\mathbb{C} \mid recording^{\mathbb{C}}$ )                                            |
|------------|-----------------------------------------------|---------------------------------------------------------------------------------------------------------|----------------------------------------------------------|-----------------------------------------------------------------------------------------------|
| 2023-02-17 | Frontier Tips & Tricks                        | Balint Joo, Group Leader,<br>Advanced Computing for<br>Nuclear, Particles, &<br>Astrophysics, ORNL      | Frontier Training Workshop $^{\ensuremath{\mathcal{C}}}$ | (slides $\mathbb{C} \mid recording^{\mathbb{C}}$ )                                            |
| 2023-02-17 | GPU Debugging                                 | Mark Stock, HPC Applications<br>Engineer, HPE                                                           | Frontier Training Workshop $^{\ensuremath{\mathcal{C}}}$ | (slides <sup>ℤ</sup>   recording <sup>ℤ</sup> )                                               |
| 2023-02-17 | GPU Profiling                                 | Alessandro Fanfarillo, Senior<br>Member of Technical Staff,<br>Exascale Application<br>Performance, AMD | Frontier Training Workshop $^{\ensuremath{\mathcal{C}}}$ | (slides <sup><math>\mathcal{C}</math></sup>   recording <sup><math>\mathcal{C}</math></sup> ) |
| 2023-02-17 | Application Profiling                         | Trey White, Master Engineer,<br>HPE                                                                     | Frontier Training Workshop $^{\ensuremath{\mathcal{C}}}$ | (slides <sup><math>\mathbb{C}</math></sup>   recording <sup><math>\mathbb{C}</math></sup> )   |
| 2023-02-16 | Orion Lustre and Best<br>Practices            | Jesse Hanley, Senior HPC<br>Linux Systems Engineer, ORNL                                                | Frontier Training Workshop <sup>™</sup>                  | (slides <sup><math>\mathbb{C}</math></sup>   recording <sup><math>\mathbb{C}</math></sup> )   |
| 2023-02-16 | Node Performance                              | Tom Papatheodore, HPC<br>Engineer, ORNL                                                                 | Frontier Training Workshop <sup>™</sup>                  | (slides <sup>@</sup>   recording <sup>@</sup> )                                               |
| 2023-02-16 | NVMe Usage                                    | Chris Zimmer, Group Leader,<br>Technology Integration, ORNL                                             | Frontier Training Workshop $^{\ensuremath{\mathcal{C}}}$ | (slides <sup><math>\mathbb{C}</math></sup>   recording <sup><math>\mathbb{C}</math></sup> )   |
| 2023-02-16 | AI on Frontier                                | Junqi Yin, Computational<br>Scientist, ORNL                                                             | Frontier Training Workshop $^{\ensuremath{\mathcal{C}}}$ | (slides <sup><math>\mathbb{C}</math></sup>   recording <sup><math>\mathbb{C}</math></sup> )   |
| 2023-02-16 | Python on Frontier                            | Michael Sandoval, HPC<br>Engineer, ORNL                                                                 | Frontier Training Workshop $^{\ensuremath{\mathcal{C}}}$ | (slides <sup>ℤ</sup>   recording <sup>ℤ</sup> )                                               |
| 2023-02-16 | HPE Cray MPI                                  | Tim Mattox, HPC Performance<br>Engineer, HPE                                                            | Frontier Training Workshop $^{\ensuremath{\mathcal{C}}}$ | (slides <sup>ℤ</sup>   recording <sup>ℤ</sup> )                                               |
| 2023-02-16 | GPU Programming Models                        | GPU Programming Models                                                                                  | Frontier Training Workshop <sup>™</sup>                  | (slides <sup>™</sup>   recording <sup>™</sup> )                                               |
| 2023-02-15 | Slurm on Frontier                             | Tom Papatheodore, HPC<br>Engineer, ORNL                                                                 | Frontier Training Workshop <sup>♂</sup>                  | (slides <sup><math>\mathbb{C}</math></sup>   recording <sup><math>\mathbb{C}</math></sup> )   |
| 2023-02-15 | Storage Areas and Data<br>Transfers           | Suzanne Parete-Koon, HPC<br>Engineer, ORNL                                                              | Frontier Training Workshop $^{\ensuremath{\mathcal{C}}}$ | (slides <sup>ℤ</sup>   recording <sup>ℤ</sup> )                                               |
| 2023-02-15 | Using the Frontier<br>Programming Environment | Matt Belhorn, HPC Engineer,<br>ORNL                                                                     | Frontier Training Workshop $^{\ensuremath{\mathcal{C}}}$ | (slides <sup>♂</sup>   recording <sup>♂</sup> )                                               |
| 2023-02-15 | Frontier Programming<br>Environment           | Wael Elwasif, Computer<br>Scientist, ORNL                                                               | Frontier Training Workshop $^{\ensuremath{\mathcal{C}}}$ | (slides <sup>C*</sup>   recording <sup>C*</sup> )                                             |
| 2023-02-15 | Epyc CPU and Instinct GPU                     | Nick Malaya, Principal<br>Member of Technical Staff,<br>Exascale Application<br>Performance, AMD        | Frontier Training Workshop $^{\ensuremath{\varnothing}}$ | (slides <sup><math>\mathbb{C}</math></sup>   recording <sup><math>\mathbb{C}</math></sup> )   |
| 2023-02-15 | Frontier Architecture<br>Overview             | Joe Glenski, Sr. Distinguished<br>Technologist, HPE                                                     | Frontier Training Workshop $^{\ensuremath{\mathcal{C}}}$ | (slides <sup>♂</sup>   recording <sup>♂</sup> )                                               |
| 2023-02-15 | Welcome to the Frontier<br>Workshop           | Ashley Barker, Section Head,<br>Operations, National Center<br>for Computational Sciences,<br>ORNL      | Frontier Training Workshop <sup>⊘</sup>                  | (slides $^{\mathscr{C}}$   recording $^{\mathscr{C}}$ )                                       |

# **OLCF PROGRAMMING GUIDES**

## Frontier User Guide Crusher Quick-Start Guide

**GPU** architecture

Node architecture

Programming environment (HIP, OpenMP<sup>®</sup>) Profiling

Debugging

**CAK RIDGE** tional Laboratory FACILITY ☆ OLCF User Documentation Search docs New User Ouick Start Accounts and Projects Connecting Systems Frontier User Guide System Overview Connecting Data and Storage 🗄 Using Globus to Move Data to Orion AMD GPUs Programming Environment Compiling Running Jobs Software Debugging Profiling Applications Tips and Tricks System Updates H Known Issues E Summit User Guide 🕀 Citadel User Guide Andes User Guide ⊞ Home Data Transfer Nodes (DTNs) High Performance Storage System Ascent Spock Quick-Start Guide E Crusher Quick-Start Guide Services and Applications Data Storage and Transfers Software Training

#### Note

#### TERMINOLOGY:

The 8 GCDs contained in the 4 MI250X will show as 8 separate GPUs according to Slurm, ROCR\_VISIBLE\_DEVICES, and the ROCR runtime, so from this point forward in the quick-start guide, we will simply refer to the GCDs as GPUs.



#### Note

There are [4x] NUMA domains per node and [2x] L3 cache regions per NUMA for a total of [8x] L3 cache regions. The 8 GPUs are each associated with one of the L3 regions as follows:

#### NUMA 0:

- hardware threads 000-007, 064-071 | GPU 4
- hardware threads 008-015, 072-079 | GPU 5

#### NUMA 1:

- hardware threads 016-023, 080-087 | GPU 2
- hardware threads 024-031, 088-095 | GPU 3

#### NUMA 2:

https://docs.olcf.ornl.gov/systems/frontier\_user\_guide.html https://docs.olcf.ornl.gov/systems/crusher\_guick\_start\_guide.html



## **ENCCS AMD TRAINING VIDEOS**

HIP programming OpenMP® offload Developing in Fortran GPU-aware MPI Roofline modeling Profiling Debugging ML frameworks

https://enccs.github.io/amd-rocm-development/

Developing Applications with the AMD ROCm Ecosystem



#### THE LESSON

Search docs

Introduction to HIP Programming

Porting Applications to HIP

Getting Started with OpenMP® Offload Applications on AMD Accelerators

Developing Fortran Applications: HIPFort, OpenMP®, and OpenACC

Exercises

Architecture

GPU-Aware MPI with ROCmTM

AMD Node Memory Model

Hierarchical Roofline on AMD InstinctTM MI200 GPUs

Affinity — Placement, Ordering and Binding

Profiling and debugging

OpenMP Offload Programming Introduction to ML Frameworks

Summary and outlook

REFERENCE

Quick Reference

Instructor's guide

/ Developing Applications with the AMD ROCm Ecosystem

#### Developing Applications with the AMD ROCm Ecosystem



Developing Applications with the AMD ROCm Ecosystem

29 November - 2 December 2022

This training material is created by AMD in collaboration with ENCCS. It covers how to develop and port applications to run on AMD GPU and CPU hardware on top AMD-powered supercomputers. You will learn about the ROCm software development languages, libraries, and tools, as well as getting a developer's view of the hardware that powers the system. The material focuses mostly on how to program applications to run on the GPU.

#### O Prerequisites

It is useful to have prior experience developing HPC applications, and some understanding of recent HPC computer hardware and the Linux operating system.

#### The lesson

- Introduction to HIP Programming
- Porting Applications to HIP
- Getting Started with OpenMP® Offload Applications on AMD Accelerators
- Developing Fortran Applications: HIPFort, OpenMP®, and OpenACC
- Exercises
- Architecture
- GPU-Aware MPI with ROCmTM
- AMD Node Memory Model
- Hierarchical Roofline on AMD InstinctTM MI200 GPUs
- Affinity Placement, Ordering and Binding
- Profiling and debugging
- OpenMP Offload Programming
- Introduction to ML Frameworks
- Summary and outlook

60

## **PAWSEY AMD TRAINING VIDEOS**

Introduction rocprof Introduction to omniprof Introduction to omnitrace **Roofline modeling** 



#### **AMD** Profiling

**Pawsey Supercomputing Research Centre** 6 videos 33 views Last updated on May 4, 2023

=+ A

> Play all 🔀 Shuffle

The AMD profiling workshop covers the AMD suite of tools for development of HPC applications on AMD GPUs.

You will learn how to use the rocprof profiler and trace visualization tool that has long been available as part of the ROCm software suite.

You will also learn how to use the new Omnitools, Omnitrace and Omniperf that were introduced at the end of 2022. Omnitrace is a powerful tracing profiler for both CPU and GPU. It can collect data from a much wider range of sources and includes hardware counters and sampling approaches. Omniperf is a performance analysis tool that can help you pinpoint how your application is performing with a visual view of the memory hierarchy on the GPU as well as reporting the percentage of peak for many different measurements.

| nd - AMD Profile | <br>and and and |
|------------------|-----------------|
| le               | l               |
| -                | 1               |

ntroduction to ROCm Profiler -AMD Profiling workshop - Day 1- Pt1 awsey Supercomputing Research Centre • 59 views • 2 months ago

Q

0

Introduction to OmniTrace - AMD Profiling workshop - Day 1 - Pt2 Pawsey Supercomputing Research Centre • 84 views • 2 months ago



Hands on workshop - AMD Profiling workshop - Day 1- Pt3 Pawsey Supercomputing Research Centre • 15 views • 2 months ago



Introduction to Omniperf - AMD Profiling workshop - Day 2- Pt1 Pawsey Supercomputing Research Centre • 34 views • 2 months ago



# DISCLAIMERS

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

THIS INFORMATION IS PROVIDED 'AS IS." AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED "AS IS" WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.

© 2023 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD CDNA, AMD ROCm, AMD Instinct, and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.

# **ATTRIBUTIONS**

Docker and the Docker logo are trademarks or registered trademarks of Docker, Inc.

Git and the Git logo are either registered trademarks or trademarks of Software Freedom Conservancy, Inc., corporate home of the Git Project, in the United States and/or other countries.

Intel is a trademark of Intel Corporation or its subsidiaries.

Kubernetes is a registered trademark of The Linux Foundation.

NAMD was developed by the Theoretical Biophysics Group in the Beckman Institute for Advanced Science and Technology at the University of Illinois at Urbana-Champaign. http://www.ks.uiuc.edu/Research/namd/

OpenCL is a trademark of Apple Inc. used by permission by Khronos Group, Inc.

OpenFOAM is a registered trademark of OpenCFD Limited, producer and distributor of the OpenFOAM software via www.openfoam.com.

The OpenMP name and the OpenMP logo are registered trademarks of the OpenMP Architecture Review Board.

Perl is a trademark of Perl Foundation.

Siemens is a registered trademark of Siemens Product Lifecycle Management Software Inc., or its subsidiaries or affiliates, in the United States and in other countries.

#