COMPREHENSIVE GENERAL LUMI COURSE 24/4/2024

# AMD HARDWARE AND SOFTWARE

Suyash Tandon, Justin Chang, Julio Maia, Noel Chalmers, Paul T. Bauman, Nicholas Curtis, Nicholas Malaya, Alessandro Fanfarillo, Jose Noudohouenou, Chip Freitag, Damon McDougall, Noah Wolfe, Jakub Kurzak, Samuel Antao, <u>George Markomanolis</u>, Bob Robey

ADVANCED MICRO DEVICES, INC.

AMD together we advance\_ slides on LUMI in /project/project\_465001098/Slides/AMD/

hands-on exercises: <u>https://hackmd.io/@gmarkoma/lumi\_finland</u>

hands-on source code: /project/project\_465001098/Exercises/AMD/HPCTrainingExamples/

# AMD HARDWARE FOR HPC AND AI CDNA ARCHITECTURE



# AMD GPUS





Radeon<sup>™</sup> Graphics Cards RDNA architecture E.g.:

- **RX 6000 Series**
- o RX 7000 Series

AMD Instinct<sup>™</sup> Accelerators
CDNA architecture
E.g.:
MI100
MI200
MI300

# AMD IN HPC





#### Frontier@ORNL

- currently the largest machine in the world
- the first computer to cross 1 exaFLOPS
- o AMD EPYC CPUs
- AMD Instinct GPUs

#### LUMI@CSC

- currently the largest machine in Europe
- $\circ$  5<sup>th</sup> fastest in the world
- o AMD EPYC CPUs
- o AMD Instinct GPUs

# AMD INSTINCT<sup>™</sup> MI200



# AMD INSTINCT<sup>M</sup> MI250X ONE OF THE WORLD'S MOST ADVANCED DATA CENTER ACCELERATOR



https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf

AMD together we advance\_

# AMD INSTINCT<sup>™</sup> MI200



# 2ND GENERATION CDNA ARCHITECTURE TAILORED-BUILT FOR HPC & AI



7

# **MULTI-CHIP DESIGN**

#### TWO GPU DIES IN PACKAGE TO MAXIMIZE COMPUTE & DATA THROUGHPUT



# 2<sup>nd</sup> GENERATION MATRIX CORES

OPTIMIZED COMPUTE UNITS FOR SCIENTIFIC COMPUTING



REPRESENTATION





# **AMD MI250X specifications**

- Two graphic compute dies (GCDs)
- 64GB of HBM2e memory per GCD (total 128GB)
- 26.5 TFLOPS peak performance per GCD
- 1.6 TB/s peak memory bandwidth per GCD
- 110 CU per GCD, total 220 CU per GPU
- The interconnection is attached to the GPU (not on the CPU)
- Both GCDs are interconnected with 200 GB/s per direction
- 128 single precision FMA operations per cycle
- AMD CDNA 2 Matrix Core supports double-precision data
- Memory coherency

AMD CDNA<sup>™</sup> 2 white paper: https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf

# NEW IN AMD INSTINCT MI250X PACKED FP32

FP64 PATH USED TO EXECUTE TWO COMPONENT VECTOR INSTRUCTIONS ON FP32

#### DOUBLES FP32 THROUGHPUT PER CLOCK PER COMPUTE UNIT

pk\_FMA, pk\_ADD, pk\_MUL, pk\_MOV operations



https://www.amd.com/en/technologies/infinity-hub/mini-hacc

together we advance\_

# MI200 COMPUTE UNIT



#### each SIMD unit

- $\circ$  has 16 SIMD lanes
- o operates on vectors (waves) of size 64
- handles up to 10 waves simultaneously

# AMD SOFTWARE FOR HPC AND AI ROCM PLATFORM



# AMD ROCm<sup>™</sup> Open Software Platform For GPU Compute





# **Open Software Platform For GPU Compute**

# AMDA ROCm

- Unlocked GPU Power To
   Accelerate Computational Tasks
- Optimized for HPC and Deep
   Learning Workloads at Scale
- Open Source Enabling Innovation,
   Differentiation, and Collaboration

| Ponchmarks & Ann Sunnart  | Optimized Training/Inference Models & Applications |                 |                 |             |                    |        |                     |              |  |
|---------------------------|----------------------------------------------------|-----------------|-----------------|-------------|--------------------|--------|---------------------|--------------|--|
| Benchmarks & App Support  | MLPERF                                             | HPL/F           | HPL/HPCG Life S |             | Science Geo Scienc |        | Science             | ce Physics   |  |
| Operating Systems Support | RHEL                                               |                 | CentC           | )S          |                    | SLES   |                     | Ubuntu       |  |
| Cluster Deployment        | Singulari                                          | ty              | Kuberne         | tes®        | Do                 | ocker® |                     | SLURM        |  |
| Framework Support         | Kokkos/RAJA                                        |                 |                 | PyTorch     |                    |        | TensorFlow          |              |  |
| Libraries                 | BLAS<br>SOLVER                                     | RAND<br>ALUTION |                 | FFT<br>ARSE | MIGrap<br>THRU     | _      | MIVisionX<br>MIOpen | PRIM<br>RCCL |  |
| Programming Models        | OpenMP® API                                        |                 |                 | OpenCL™     |                    |        | ΗΙΡ ΑΡΙ             |              |  |
| Development Toolchain     | Compiler                                           | Profiler        | Т               | racer       | Debug              | ger    | hipify              | GPUFort      |  |
| Drivers & Runtime         | GPU Device Drivers and ROCm Run-Time               |                 |                 |             |                    |        |                     |              |  |
| Deployment Tools          | ROCm Valio                                         | dation Suite    | RO              | Cm Data     | Center T           | ool    | ROO                 | Cm SMI       |  |

AMDL

together we advance\_

# AMDA ROCM 5.0 Democratizing exascale for all

| EXPANDING                                    | OPTIMIZING                                     | ENABLING                                        |
|----------------------------------------------|------------------------------------------------|-------------------------------------------------|
| SUPPORT & ACCESS                             | PERFORMANCE                                    | DEVELOPER SUCCESS                               |
| <ul> <li>Support for Radeon Pro</li></ul>    | <ul> <li>MI200 Optimizations: FP64</li></ul>   | <ul> <li>HPC Apps &amp; ML Frameworks</li></ul> |
| W6800 Workstation GPUs                       | Matrix ops, Improved Cache                     | on AMD InfinityHub                              |
| <ul> <li>Remote access through the</li></ul> | <ul> <li>Improved launch latency and</li></ul> | <ul> <li>Streamlined and improved</li></ul>     |
| AMD Accelerator Cloud                        | kernel performance                             | tools increasing productivity                   |

#### Public

# LIBRARIES

#### rocBLAS / hipBLAS

• basic operations on dense matrices

#### rocSOLVER

 $\circ$  dense linear algebra solvers

#### rocSPARSE / hipSPARSE

 $\circ$   $\,$  basic operations on sparse matrices

#### rocALUTION

 $\circ~$  sparse linear algebra solvers

#### rocFFT / hipFFT

Fast Fourier transforms

#### rocRAND / hipRAND

o random number generation

#### rocPRIM / hipCUB / rocThrust

 $\circ$  scan, sort, reduction, etc.

https://github.com/ROCmSoftwarePlatform/rocBLAS https://github.com/ROCmSoftwarePlatform/hipBLAS

https://github.com/ROCmSoftwarePlatform/rocSOLVER

https://github.com/ROCmSoftwarePlatform/rocSPARSE https://github.com/ROCmSoftwarePlatform/hipSPARSE

https://github.com/ROCmSoftwarePlatform/rocALUTION

https://github.com/ROCmSoftwarePlatform/rocFFT https://github.com/ROCmSoftwarePlatform/hipFFT

https://github.com/ROCmSoftwarePlatform/rocRAND https://github.com/ROCmSoftwarePlatform/hipRAND

https://github.com/ROCmSoftwarePlatform/rocPRIM https://github.com/ROCmSoftwarePlatform/hipCUB https://github.com/ROCmSoftwarePlatform/rocThrust

# **ALSO OPEN SOURCE**

#### the compiler

<u>https://github.com/ROCmSoftwarePlatform/llvm-project</u>

#### the runtime

o <u>https://github.com/RadeonOpenCompute/ROCR-Runtime</u>

#### the debugger

o <u>https://github.com/ROCm-Developer-Tools/ROCgdb</u>

#### the profiler

o <u>https://github.com/ROCm-Developer-Tools/rocprofiler</u>

#### the HPL benchmark

o <u>https://github.com/ROCmSoftwarePlatform/rocHPL</u>

#### the HPCG benchmark

<u>https://github.com/ROCmSoftwarePlatform/rocHPCG</u>

etc.

# AMD SOFTWARE FOR HPC AND AI HIP PROGRAMMING



## **GPU ACCELERATION** HOST AND DEVICE

#### the host is the CPU

- host code runs here
- usual C++ syntax and features
- entry point is the "main" function
- $\circ$   $\,$  use the HIP API to  $\,$ 
  - $\circ$  create device buffers
  - $\circ$  moved data between host and device
  - $\circ$  launch device code

#### the device is the GPU

- $\circ$  device code runs here
- C/C++ syntax and features
- o device code is launched as "kernels"
- o instructions from the host are sent to streams

together we advance\_



### FUNCTION QUALIFIERS HOST AND DEVICE

#### \_\_global\_\_\_

o "kernels"

- $\circ$  execute the GPU
- $\circ$   $\,$  can be called from the CPU  $\,$

#### \_\_device\_\_\_

- $\circ$   $\,$  execute the GPU  $\,$
- can be called from device code (kernels or a \_\_\_\_\_device\_\_\_\_ functions)

#### \_\_host\_\_\_device\_\_

- $\circ$   $\,$  executes on the CPU when called from CPU code
- $\circ$   $\,$  executes on the GPU when called from GPU code

### HIP KERNEL LANGUAGE GPU CODE

#### in 2D

- $\circ$  each colored box is a block
- each block has an index blockIdx.[xyz]
- o each small square is a thread
- each thread has a 2D index threadIdx.[xyz]
- grid dimensions in blockDim.[xyz]





### HIP KERNEL LANGUAGE GPU CODE

- all local variables and arrays are thread-private
- threads can exchange data through shared memory (LDS)
- declare using the \_\_shared\_\_ keyword
- use \_\_syncthreads() to synchronize





### HIP KERNEL LANGUAGE GPU CODE

#### saxpy loop

- $\circ$  two 1D arrays
- $\circ$  the y[i] += a\*x[i] operation
- mapped to 1D grid of threads/blocks
- $\circ$  each thread takes on index

```
1 #include <cuda.h>
2
3 ___constant__ float a = 2.0f;
4
5 __global__
6 void saxpy(int n, float const* x, float* y)
7 {
8 int i = blockDim.x*blockIdx.x + threadIdx.x;
9 if (i < n)
10 y[i] += a*x[i];
11 }</pre>
```

### HIP API MEMORY MANAGEMENT

| hipError_t | hipMalloc (void **ptr, size_t size)                                                                                                                                                                        |
|------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|            |                                                                                                                                                                                                            |
| hipError_t | hipFree (void *ptr)                                                                                                                                                                                        |
|            | Free memory allocated by the hcc hip memory allocation API. This API performs an implicit hipDeviceSynchronize() call. If pointer is NULL, the hip runtime is initialized and hipSuccess is returned. More |
| hipError_t | <pre>hipMemcpy (void *dst, const void *src, size_t sizeBytes, hipMemcpyKind kind)</pre>                                                                                                                    |
|            | Copy data from src to dst. More                                                                                                                                                                            |

- GPU operates on GPU memory
- need to allocate GPU memory
- $\circ$   $\,$  need to copy data between the CPU memory and the GPU memory

https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group memory.html

### HIP API ERROR HANDLING

- $\circ$  check last error
- o get error name
- $\circ$  get error string

| hipError_t   | hipGetLastError (void)                                                                                         |
|--------------|----------------------------------------------------------------------------------------------------------------|
|              | Return last error returned by any HIP runtime API call and resets the stored error code to hipSuccess.<br>More |
| hipError_t   | hipPeekAtLastError (void)                                                                                      |
|              | Return last error returned by any HIP runtime API call. More                                                   |
| const char * | hipGetErrorName (hipError_t hip_error)                                                                         |
|              | Return hip error as text string form. More                                                                     |
| const char * | hipGetErrorString (hipError_t hipError)                                                                        |
|              | Return handy text string message to explain the error which occurred. More                                     |



### HIP API DEVICE MANAGEMENT

- $\circ$  check number of devices
- $\circ$  switch devices
- o synchronize devices

| hipError_t | hipDeviceSynchronize (void)                                                       |
|------------|-----------------------------------------------------------------------------------|
|            | Waits on all active streams on current device. More                               |
| hipError_t | hipDeviceReset (void)                                                             |
|            | The state of current device is discarded and updated to a fresh state. More       |
| hipError_t | hipSetDevice (int deviceId)                                                       |
|            | Set default device to be used for subsequent hip API calls from this thread. More |
| hipError_t | hipGetDevice (int *deviceId)                                                      |
|            | Return the default device id for the calling host thread. More                    |
| hipError_t | hipGetDeviceCount (int *count)                                                    |
|            | Return number of compute-capable devices. More                                    |

https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group device.html

### HIP API STREAM MANAGEMENT

- $\circ$  create stream
- o destroy stream
- $\circ$  synchronize stream

| hipError_t | hipStreamCreate (hipStream_t *stream)             |
|------------|---------------------------------------------------|
|            | Create an asynchronous stream. More               |
| hipError_t | hipStreamDestroy (hipStream_t stream)             |
|            | Destroys the specified stream. More               |
| hipError_t | hipStreamSynchronize (hipStream_t stream)         |
|            | Wait for all commands in stream to complete. More |

- $\circ$  etc.
- $\circ$  etc.
- $\circ$  etc.

https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group stream.html

Suppose we have 4 small kernels to execute:

hipLaunchKernelGGL(myKernel1, dim3(1), dim3(256), 0, 0, 256, d\_a1); hipLaunchKernelGGL(myKernel2, dim3(1), dim3(256), 0, 0, 256, d\_a2); hipLaunchKernelGGL(myKernel3, dim3(1), dim3(256), 0, 0, 256, d\_a3); hipLaunchKernelGGL(myKernel4, dim3(1), dim3(256), 0, 0, 256, d\_a4);

• Even though these kernels use only one block each, they'll execute in serial on the NULL stream:



• With streams we can effectively share the GPU's compute resources:

| <pre>hipLaunchKernelGGL(myKernel1,</pre> | dim3(1), | dim3(256), | 0, | stream1, | <b>256</b> , | d_a1); |
|------------------------------------------|----------|------------|----|----------|--------------|--------|
| <pre>hipLaunchKernelGGL(myKernel2,</pre> | dim3(1), | dim3(256), | 0, | stream2, | <b>256</b> , | d_a2); |
| <pre>hipLaunchKernelGGL(myKernel3,</pre> | dim3(1), | dim3(256), | 0, | stream3, | <b>256</b> , | d_a3); |
| <pre>hipLaunchKernelGGL(myKernel4,</pre> | dim3(1), | dim3(256), | 0, | stream4, | 256,         | d_a4); |

| NULL<br>Stream |           |  |
|----------------|-----------|--|
| Stream1        | myKernel1 |  |
| Stream2        | myKernel2 |  |
| Stream3        | myKernel3 |  |
| Stream4        | myKernel4 |  |

Note 1: Kernels must modify different parts of memory to avoid data races. Note 2: With large kernels, overlapping computations may not help performance.

- There is another use for streams besides concurrent kernels:
  - Overlapping kernels with data movement.
- AMD GPUs have separate engines for:
  - Host->Device memcpys
  - Device->Host memcpys
  - Compute kernels.
- These three different operations can overlap without dividing the GPU's resources.
  - The overlapping operations should be in separate, non-NULL, streams.
  - The host memory should be **pinned**.

Suppose we have 3 kernels which require moving data to and from the device:

hipMemcpy(d\_a1, h\_a1, Nbytes, hipMemcpyHostToDevice)); hipMemcpy(d\_a2, h\_a2, Nbytes, hipMemcpyHostToDevice)); hipMemcpy(d\_a3, h\_a3, Nbytes, hipMemcpyHostToDevice));

hipLaunchKernelGGL(myKernel1, blocks, threads, 0, 0, N, d\_a1); hipLaunchKernelGGL(myKernel2, blocks, threads, 0, 0, N, d\_a2); hipLaunchKernelGGL(myKernel3, blocks, threads, 0, 0, N, d\_a3);

hipMemcpy(h\_a1, d\_a1, Nbytes, hipMemcpyDeviceToHost); hipMemcpy(h\_a2, d\_a2, Nbytes, hipMemcpyDeviceToHost); hipMemcpy(h\_a3, d\_a3, Nbytes, hipMemcpyDeviceToHost);

| NULL Stream HToD1 HToD2 HToD3 1 2 3 DToH1 DToH2 DT | NULL Stream HToD1 HToD2 HToD3 myKernel myKernel myKernel DToH1 DToH2 |
|----------------------------------------------------|----------------------------------------------------------------------|
|----------------------------------------------------|----------------------------------------------------------------------|

Changing to asynchronous memcpys and using streams:

hipMemcpyAsync(d\_a1, h\_a1, Nbytes, hipMemcpyHostToDevice, stream1); hipMemcpyAsync(d\_a2, h\_a2, Nbytes, hipMemcpyHostToDevice, stream2); hipMemcpyAsync(d\_a3, h\_a3, Nbytes, hipMemcpyHostToDevice, stream3);

hipLaunchKernelGGL(myKernel1, blocks, threads, 0, stream1, N, d\_a1); hipLaunchKernelGGL(myKernel2, blocks, threads, 0, stream2, N, d\_a2); hipLaunchKernelGGL(myKernel3, blocks, threads, 0, stream3, N, d\_a3);

hipMemcpyAsync(h\_a1, d\_a1, Nbytes, hipMemcpyDeviceToHost, stream1); hipMemcpyAsync(h\_a2, d\_a2, Nbytes, hipMemcpyDeviceToHost, stream2); hipMemcpyAsync(h\_a3, d\_a3, Nbytes, hipMemcpyDeviceToHost, stream3);

| NULL Stream |       |               |               |               |       |  |
|-------------|-------|---------------|---------------|---------------|-------|--|
| Stream1     | HToD1 | myKernel<br>1 | DToH1         |               |       |  |
| Stream2     |       | HToD2         | myKernel<br>2 | DToH2         |       |  |
| Stream3     |       |               | HToD3         | myKernel<br>3 | DToH3 |  |

# AMD LINGO





| 1<br>2      | #include <cuda.h></cuda.h>                                                |
|-------------|---------------------------------------------------------------------------|
| 2<br>3<br>4 | constant <i>float</i> a = 2.0 <i>f</i> ;                                  |
| 5           | global                                                                    |
| 6           | <i>void</i> saxpy( <i>int</i> n, <i>float</i> const* x, <i>float</i> * y) |
| 7           | {                                                                         |
| 8           | int i = blockDim.x*blockIdx.x + threadIdx.x;                              |
| 9           | if (i < n)                                                                |
| 10          | y[i] += a*x[i];                                                           |
| 11          | }                                                                         |

- $\circ$  vector addition kernel in CUDA
- $\circ$  each thread takes one array index
- $\circ$  and performs one multiply-and-add operation

## SIMPLE SAXPY KERNEL

```
ADDING THE CPU CODE
    #include <cuda.h>
    \_constant_{float} a = 2.0f;
    __global__
    void saxpy(int n, float const* x, float* y)
        int i = blockDim.x*blockIdx.x + threadIdx.x;
        if (i < n)
            y[i] += a*x[i];
    int main()
14
        int n = 256;
        std::size_t size = sizeof(float)*n;
        float* d_x;
        float* d_y;
        cudaMalloc(&d_x, size);
                                                                      allocate arrays in device memory
        cudaMalloc(&d_y, size);
22
        int num_blocks = 2;
                                                                      set up the grid
        int num_threads = 128;
24
                                                                      launch the kernel
        saxpy<<<num_blocks, num_threads>>>(n, d_x, d_y); <-</pre>
        cudaDeviceSynchronize();
```



```
ADDING HOST↔ DEVICE COPIES
    #include <cuda.h>
    \_constant_{float} a = 2.0f;
    __global__
    void saxpy(int n, float const* x, float* y)
        int i = blockDim.x*blockIdx.x + threadIdx.x;
        if (i < n)
            y[i] += a*x[i];
    int main()
14
        int n = 256;
        std::size_t size = sizeof(float)*n;
        float* h_x = (float*)malloc(size);
                                                                     allocate arrays in host memory
        float* h_y = (float*)malloc(size);
        float* d_x;
        float* d_y;
        cudaMalloc(&d_x, size);
        cudaMalloc(&d_y, size);
24
        cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice);
                                                                     copy content to device memory
        cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice);
        int num_blocks = 2;
        int num_threads = 128;
        saxpy<<<<num_blocks, num_threads>>>(n, d_x, d_y);
                                                                     copy results back to host memory
        cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost);
        cudaDeviceSynchronize();
```

```
Public]
```

```
\_constant_{float} a = 2.0f;
__global__
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
       y[i] += a*x[i];
int main()
   std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
   float* d x;
   float* d_y;
   cudaMalloc(&d_x, size);
   cudaMalloc(&d_y, size);
   cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice);
   cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice);
   int num_blocks = 2;
   int num threads = 128;
   saxpy << <num_blocks, num_threads>>>> (n, d_x, d_y);
   cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost);
   cudaDeviceSynchronize();
   cudaFree(d_x);
                                                                   free arrays in device memory
   cudaFree(d_y);
    free(h_x);
                                                                   free arrays in host memory
    free(h_y);
```

# ADDING MEMORY CLEANUP

```
#include <cassert>
\__constant__ float a = 2.0f;
__global__
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
       y[i] += a*x[i];
#define CHECK(call) assert(call == cudaSuccess) 
int main()
    std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
    assert(h_x != nullptr);
    assert(h_y != nullptr);
    float* d_x;
   float* d_y;
    CHECK(cudaMalloc(&d_x, size));
    CHECK(cudaMalloc(&d_y, size));
    CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
    int num_blocks = 2;
    int num_threads = 128;
    saxpy<<<<num_blocks, num_threads>>>(n, d_x, d_y);
    CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
    CHECK(cudaDeviceSynchronize());
    CHECK(cudaFree(d_x));
    CHECK(cudaFree(d_y));
    free(h_x);
    free(h_y);
```

# **ADDING ERROR CHECKS**

simple error checking macro

#### simple CUDA code

```
#include <cassert>
\_constant_{float} a = 2.0f;
__global__
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
        y[i] += a*x[i];
#define CHECK(call) assert(call == cudaSuccess)
int main()
    std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
    assert(h_x != nullptr);
    assert(h_y != nullptr);
    float* d_x;
    float* d_y;
    CHECK(cudaMalloc(&d_x, size));
    CHECK(cudaMalloc(&d_y, size));
    CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
    int num_blocks = 2;
    int num_threads = 128;
    saxpy<<<<num_blocks, num_threads>>>(n, d_x, d_y);
    CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
    CHECK(cudaDeviceSynchronize());
    CHECK(cudaFree(d_x));
    CHECK(cudaFree(d_y));
    free(h_x);
    free(h_y);
```

#### simple CUDA code

```
#include <cassert>
\_constant_{float} a = 2.0f;
__global__
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
        y[i] += a * x[i];
#define CHECK(call) assert(call == cudaSuccess)
int main()
    std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
    assert(h_x != nullptr);
    assert(h_y != nullptr);
    float* d_x;
    float* d_y;
    CHECK(cudaMalloc(&d_x, size));
    CHECK(cudaMalloc(&d_y, size));
    CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
    int num_blocks = 2;
    int num_threads = 128;
    saxpy<<<<num_blocks, num_threads>>>(n, d_x, d_y);
    CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
    CHECK(cudaDeviceSynchronize());
   CHECK(cudaFree(d_x));
    CHECK(cudaFree(d_y));
    free(h_x);
    free(h_y);
```

#### same code in HIP

```
#include <cassert>
\_constant_{float} a = 2.0f;
__global__
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
        y[i] += a * x[i];
#define CHECK(call) assert(call == hipSuccess)
int main()
    std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
   assert(h_x != nullptr);
   assert(h_y != nullptr);
    float* d_x;
    float* d_y;
   CHECK(hipMalloc(&d_x, size));
    CHECK(hipMalloc(&d_y, size));
    CHECK(hipMemcpy(d_x, h_x, size, hipMemcpyHostToDevice));
    CHECK(hipMemcpy(d_y, h_y, size, hipMemcpyHostToDevice));
    int num_blocks = 2;
    int num_threads = 128;
    saxpy << <num_blocks, num_threads>>> (n, d_x, d_y);
    CHECK(hipMemcpy(h_y, d_y, size, hipMemcpyDeviceToHost));
   CHECK(hipDeviceSynchronize());
   CHECK(hipFree(d_x));
   CHECK(hipFree(d_y));
    free(h_x);
    free(h_y);
```

# spot the differences

#### simple CUDA code

```
#include <cassert>
\_constant_{float} a = 2.0f;
__global__
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
        y[i] += a * x[i];
#define CHECK(call) assert(call == cudaSuccess)
int main()
    std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
    assert(h_x != nullptr);
    assert(h_y != nullptr);
    float* d_x;
    float* d_y;
    CHECK(cudaMalloc(&d_x, size));
    CHECK(cudaMalloc(&d_y, size));
    CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
    int num_blocks = 2;
    int num_threads = 128;
   saxpy <<< num_blocks, num_threads>>> (n, d_x, d_y);
    CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
    CHECK(cudaDeviceSynchronize());
    CHECK(cudaFree(d_x));
    CHECK(cudaFree(d_y));
    free(h_x);
    free(h_y);
```

#### same code in HIP

```
#include <hip/hip_runtime.h>
#include <cassert>
\_constant_{float} a = 2.0f;
__global__
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
       y[i] += a * x[i];
#define CHECK(call) assert(call == hipSuccess)
int main()
    std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
   assert(h_x != nullptr);
   assert(h_y != nullptr);
    float* d_x;
    float* d_y;
   CHECK(hipMalloc(&d_x, size));
   CHECK(hipMalloc(&d_y, size));
   CHECK(hipMemcpy(d_x, h_x, size, hipMemcpyHostToDevice));
   CHECK(hipMemcpy(d_y, h_y, size, hipMemcpyHostToDevice));
    int num_blocks = 2;
    int num_threads = 128;
    saxpy<<<num_blocks, num_threads>>>(n, d_x, d_y);
   CHECK(hipMemcpy(h_y, d_y, size, hipMemcpyDeviceToHost));
   CHECK(hipDeviceSynchronize());
   CHECK(hipFree(d_x));
   CHECK(hipFree(d_y));
    free(h_x);
    free(h_y);
```

#### Public

# **HIPIFY TOOLS**

#### hipify-clang

- compiler (clang) based translator
- handles very complex constructs
- $\circ$  prints an error if not able to translate
- $\circ$  supports clang options
- requires CUDA

## hipify-perl

- Perl<sup>®</sup> script
- $\circ~$  relies on regular expressions
- $\circ$  may struggle with complex constructs
- $\circ$  does not require CUDA

#### https://github.com/ROCm-Developer-Tools/HIPIFY

```
[Public]
```

```
#include <cassert>
\_constant_{float} a = 2.0f;
__global_
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
        y[i] += a * x[i];
#define CHECK(call) assert(call == cudaSuccess)
int main()
    std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
    assert(h_x != nullptr);
    assert(h_y != nullptr);
    float* d_x;
    float* d_y;
    CHECK(cudaMalloc(&d_x, size));
    CHECK(cudaMalloc(&d_y, size));
    CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
    int num_blocks = 2;
    int num_threads = 128;
    saxpy <<< num_blocks, num_threads>>> (n, d_x, d_y);
    CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
    CHECK(cudaDeviceSynchronize());
    CHECK(cudaFree(d_x));
    CHECK(cudaFree(d_y));
    free(h_x);
    free(h_y);
```

saxpy\$ perl /opt/rocm/bin/hipify-perl -examin saxpy.cu

```
[HIPIFY] info: file 'saxpy.cu' statisitics:
  CONVERTED refs count: 13
  TOTAL lines of code: 46
  WARNINGS: 0
[HIPIFY] info: CONVERTED refs by names:
    cuda.h => hip/hip_runtime.h: 1
    cudaDeviceSynchronize => hipDeviceSynchronize: 1
    cudaFree => hipFree: 2
    cudaMalloc => hipMalloc: 2
    cudaMemcpy => hipMemcpy: 3
    cudaMemcpyDeviceToHost => hipMemcpyDeviceToHost: 1
    cudaMemcpyHostToDevice => hipMemcpyHostToDevice: 2
    cudaSuccess => hipSuccess: 1
    saxpy$
```

## hipify-perl

#### hipify-perl -examin

- for initial assessment
- o no replacements done
- prints basic statistics and the number of replacements

```
[Public]
```

```
#include <cassert>
\_constant_{float} a = 2.0f;
__global_
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
        y[i] += a*x[i];
#define CHECK(call) assert(call == cudaSuccess)
int main()
    std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
    assert(h_x != nullptr);
    assert(h_y != nullptr);
    float* d_x;
    float* d_y;
    CHECK(cudaMalloc(&d_x, size));
    CHECK(cudaMalloc(&d_y, size));
    CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
    int num_blocks = 2;
    int num_threads = 128;
    saxpy < < <num_blocks, num_threads>>> (n, d_x, d_y);
    CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
    CHECK(cudaDeviceSynchronize());
   CHECK(cudaFree(d_x));
    CHECK(cudaFree(d_y));
    free(h_x);
    free(h_y);
```

```
saxpy$ perl /opt/rocm/bin/hipify-perl saxpy.cu
#include "hip/hip_runtime.h"
#include <hip/hip_runtime.h>
#include <cassert>
```

\_\_constant\_\_ float a = 2.0f;

```
__global__
void saxpy(int n, float const* x, float* y)
{
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
        y[i] += a*x[i];
}</pre>
```

#define CHECK(call) assert(call == hipSuccess)

```
int main()
{
    int n = 256;
    std::size_t size = sizeof(float)*n;
```

float\* h\_x = (float\*)malloc(size); float\* h\_y = (float\*)malloc(size); assert(h\_x != nullptr); assert(h\_y != nullptr);

float\* d\_x; float\* d\_y; CHECK(hipMalloc(&d\_x, size)); CHECK(hipMalloc(&d\_y, size));

CHECK(hipMemcpy(d\_x, h\_x, size, hipMemcpyHostToDevice)); CHECK(hipMemcpy(d\_y, h\_y, size, hipMemcpyHostToDevice));

int num\_blocks = 2; int num\_threads = 128; saxpy<<<<num\_blocks, num\_threads>>>(n, d\_x, d\_y);

CHECK(hipMemcpy(h\_y, d\_y, size, hipMemcpyDeviceToHost)); CHECK(hipDeviceSynchronize());

CHECK(hipFree(d\_x)); CHECK(hipFree(d\_y));

free(h\_x); free(h\_y);

saxpy\$

## hipify-perl

translating a file to standard output

#### but can also

- o translate in place
- preserve orig copy
- recursively do folders

```
[Public]
```

```
#include <cassert>
#include "cuda2hip.h" 🗲
__constant__ float a = 2.0f;
__global__
void saxpy(int n, float const* x, float* y)
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    if (i < n)
        y[i] += a*x[i];
#define CHECK(call) assert(call == cudaSuccess)
int main()
     std::size_t size = sizeof(float)*n;
    float* h_x = (float*)malloc(size);
    float* h_y = (float*)malloc(size);
    assert(h_x != nullptr);
    assert(h_y != nullptr);
    float* d_x;
     float* d_y;
    CHECK(cudaMalloc(&d_x, size));
    CHECK(cudaMalloc(&d_y, size));
    CHECK(cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice));
     CHECK(cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice));
    int num_blocks = 2;
    int num_threads = 128;
    saxpy << <num_blocks, num_threads>>> (n, d_x, d_y);
    CHECK(cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost));
    CHECK(cudaDeviceSynchronize());
    CHECK(cudaFree(d_x));
    CHECK(cudaFree(d_y));
     free(h_x);
    free(h_y);
```

|   | #define | cudaSuccess            | hipSuccess           |
|---|---------|------------------------|----------------------|
| 2 | #define | cudaMalloc             | hipMalloc            |
| 3 | #define | cudaMemcpy             | hipMemcpy            |
|   | #define | cudaMemcpyHostToDevice | hipMemcpyHostToDevic |
| 5 | #define | cudaMemcpyDeviceToHost | hipMemcpyDeviceToHos |
| 6 | #define | cudaDeviceSynchronize  | hipDeviceSynchronize |
| 7 | #define | cudaFree               | hipFree              |
| 8 |         |                        |                      |

#### alternatively

- create a file with renaming macros
- include conditionally, depending on target

# **OPTIMIZATION TECHNIQUES**

#### basic

- thread divergence / SIMDzation
- reuse in shared memory & bank conflicts
- coalescing of global memory accesses
- resource partitioning / occupancy / spills
- $\circ~$  L1, L2 cache blocking

0 ...

#### advanced

- $\circ$  atomics
- warp primitives
- CPU-GPU coherence
- $\circ$  inter-stream synchronization
- 0 ...

#### Public

# **DIFFERENCES FROM CUDA**

- o warpSize
  - $\circ$  64 on AMD
  - o 32 on NVIDIA
- dynamic parallelism not supported
- $\circ$  exercise caution:
  - $\circ$  atomics
  - managed memory
  - warp-level primitives
  - inter-process communication

# AMD RESOURCES DOCUMENTATION AND TRAINING



## AMD ROCM DEVELOPER HUB

## Engage with ROCm Experts

Participate in ROCm Webinar Series Post questions, view FAQ's in Community Forum

#### Increase Understanding

Purchase ROCm Text Book View the latest news in the Blogs

## Get Started Using ROCm

ROCm Documentation on GitHub Download the Latest Version of ROCm

https://www.amd.com/en/developer/rocm-hub.html

Intro to ROCm Start Using ROCm Community Infinity Hub Blogs

AMD ROCm<sup>™</sup> is an open software platform for accelerated compute offering a code once, use everywhere approach. Access all ROCm developer resources here — from documentation, to training webinars, to the latest blogs, and more.

#### Get to Know ROCm™



 Training Webinars

 Register for an upcoming ROCm training webinar or view previous webinars on-demand.

 Register Now I2



ROCm Textbook
Learn about using ROCm with the "Accelerated Computing with
HIP" textbook.
Get Now P?



ROCm Docs Find the latest documentation on all ROCm releases.

See Docs





Ask the ROCm Community

This community is for ROCm users to come together to learn, share experiences, and help solve issues using the ROCm platform.

Learn More 🖸

## **NEW ROCM DOCS**

## Comprehensive Coverage

Compilers and Frameworks Math libraries, communication libraries Management tools, validation tools

Howto Guides

Installation Tunning Debugging

000000000000

https://rocm.docs.amd.com/

| GitHub                                                                                                                                                                                  | Community         | AMD Lab Notes                                                                                                                                                                   | Infinity Hub                                                     | Support     | Feedback                                                                                              |                                                                  |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|-------------|-------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|
| ROCm Docume<br>Home<br>What is ROCm?<br>Deploy ROCm                                                                                                                                     | entation          | <pre> AMD RO Applies to Linux E </pre>                                                                                                                                          | Cm™ D<br>₃ 2023-05-25 ⊙                                          |             | ntation                                                                                               | Ø                                                                |
| Linux Quick Start<br>Linux Overview<br>Docker                                                                                                                                           | ~                 | What is ROCm?                                                                                                                                                                   | ~                                                                | Deploy ROCm | ~                                                                                                     | Release Info ~                                                   |
| Release Info<br>Release Notes<br>Changelog<br>GPU Support and OS<br>(Linux)<br>Known Issues 2<br>Compatibility<br>Licensing Terms<br>APIs and Reference<br>All Reference Materia<br>HIP | e                 | APIs and Reference<br>Compilers and<br>HIP<br>OpenMP<br>Math Libraries<br>C++ Primitives<br>Communication<br>AI Libraries<br>Computer Visio<br>Management T<br>Validation Tools | Development Tools<br>Libraries<br>n Libraries<br>n<br>ools       |             | Understand ROC<br>• Compiler Dis<br>• Using CMak<br>• Linux Folder<br>• GPU Isolatic<br>• GPU Archite | sambiguation<br>e<br>: Structure Reorganization<br>on Techniques |
| Math Libraries<br>C++ Primitive Librarie<br>Communication Libra<br>Al Libraries<br>Computer Vision<br>OpenMP<br>Compilers and Tools<br>Management Tools<br>Validation Tools             | aries ×<br>×<br>× | GPU Aware MF                                                                                                                                                                    | Deep Learning with<br>stallation<br>stallation<br>v Installation |             |                                                                                                       | ·                                                                |
| Understand ROCm<br>All Explanation Mater<br>Compiler Disambigue                                                                                                                         | rial              |                                                                                                                                                                                 |                                                                  |             |                                                                                                       | Ne<br>What is ROCm                                               |

together we advance\_

AMDA ROCm<sup>™</sup> Platform 5.6.0

## HIP TEXTBOOK

## Comprehensive Coverage

HIP Language AMD GPU Internals Performance Analysis Debugging Programming Patterns ROCm Libraries Porting to HIP Multi-GPU Programming Third Party Tools CDNA Assembly



https://www.barnesandnoble.com/w/accelerated-computing-with-hip-yifan-sun/1142866934

## AMD INFINITY HUB

#### AMD Instinct<sup>™</sup> MI200 SUPPORT

29 key applications & frameworks on Infinity Hub & a catalogue supporting over 90 applications, frameworks & tools

### Accelerating Instinct<sup>™</sup> adoption

Over 17000 application pulls. 10000+ since last year

### PERFORMANCE RESULTS

Published Performance Results for Select Apps / Benchmarks

https://www.amd.com/en/technologies/infinity-hub

#### AMD Infinity Hub



## SOFTWARE CATALOG

STRONG MOMENTUM AND INCREASING LIST OF SUPORTED APPLICATION, LIBRARIES & FRAMEWORKS

| Life Science                                             | Physics                                                                         | Chemistry                                             | CFD                                                                                              | Earth Science                                                                                      |
|----------------------------------------------------------|---------------------------------------------------------------------------------|-------------------------------------------------------|--------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
| AMBER<br>GROMACS<br>NAMD<br>LAMMPS<br>Hoomd-Blue<br>VASP | MILC<br>GRID<br>QUANTUM ESPRESSO<br>N-Body<br>CHROMA<br>PIConGPU<br>QuickSilver | CP2K<br>QUDA<br>NWCHEM<br>TERACHEM<br>QMCPACK         | OpenFOAM®<br>AMR-WIND<br>NEKBONE<br>LAGHOS<br>NEKO<br>NEKRS<br>PeleC                             | EXAGO<br>DEVITO<br>OCCA<br>SPECFEM3D-GLOBE<br>SPECFEM3D-CARTESIAN<br>ACECAST (WRF)<br>MPAS<br>ICON |
| Benchmarks                                               | Libraries                                                                       | ML Frameworks                                         | ISV Applications                                                                                 |                                                                                                    |
| HPL<br>HPCG<br>AMG<br>ML - TORCHBENCH<br>ML - SUPERBENCH | AMR-EX<br>Ginkko<br>HYPRE<br>TRILINOS                                           | PYTORCH<br>TENSORFLOW<br>JAX<br>ONNX<br>OPENAI TRITON | ANSYS MECHANICAL<br>CADENCE CHARLES<br>ANSYS FLUENT*<br>SIEMENS® STAR-CCM+*<br>SIEMENS® CALIBRE* | + MANY MORE                                                                                        |

## AMD LAB NOTES



Search GPUOpen...

Q

知乎 🕑 🎽 🕲 🔊

HOME SOFT

SOFTWARE V DOCS V

 Image: General control of the second seco

Home » Blogs » AMD lab notes » AMD matrix cores



Finite difference method - Laplacian part 1

• Finite difference method - Laplacian part 2

• Finite difference method - Laplacian part 3

Finite difference method - Laplacian part 4

Example 1 - V\_MFMA\_F32\_16x16x4F32 Example 2 - V\_MFMA\_F32\_16x16x1F32 Example 3 - V\_MFMA\_F64\_4x4x4F64 A note on rocWMMA

Introduction to profiling tools for AMD

AMD Instinct<sup>™</sup> MI200 GPU memory space

Gegister pressure in AMD CDNA™2 GPUs
 GPU-aware MPI with ROCm

A note on the AMD Matrix Instruction Calculator ...

AMD lab notes

AMD matrix cores

References

overview

Search this manual

Using AMD matrix cores MFMA compiler intrinsic syntax

#### AMD matrix cores 🛷

Matrix multiplication is a fundamental aspect of Linear Algebra and it is an ubiquitous computation within High Performance Computing (HPC) Applications. Since the introduction of AMD's CDNA Architecture, Generalized Matrix Multiplication (GEMM) computations are now hardware-accelerated through Matrix Core Processing Units. Matrix Core accelerated GEMM kernels lie at the heart of BLAS libraries like rocBLAS but they can also be programmed directly by developers. Applications that are throughput bound by GEMM computation can achieve additional speedups by utilizing Matrix Cores.

AMD's Matrix Core technology supports a full range of mixed precision operations bringing us the ability to work with large models and enhance memory-bound operation performance for any combination of AI and machine learning workloads. The various numerical formats have uses in different applications. Examples include use of 8-bit integers (INT8) for ML inference, 32-bit floating point (FP32) data for ML Training and HPC applications, 16-bit floating point (FP16) data for graphics workloads and 16-bit brain float (BF16) data for ML training with fewer convergence issues.

To learn more about the theoretical speedups achievable by using matrix cores compared to SIMD Vector Units, please refer to the tables below. The tables list the performance of the Vector (i.e. Fused Multiply-Add or FMA) and Matrix core units of the previous generation (MI100) and current generation (MI250X) of CDNA Accelerators.

Matrix Core Performance for MI100 and MI250X:

| Data format | MI100 Flops/Clock/CU | MI250X Flops/Clock/CU |  |  |
|-------------|----------------------|-----------------------|--|--|
| FP64        | N/A                  | 256                   |  |  |
| FP32        | 256                  | 256                   |  |  |
| FP16        | 1024                 | 1024                  |  |  |
| BF16        | 512                  | 1024                  |  |  |
| INT8        | 1024                 | 1024                  |  |  |

Vector (FMA) Unit Performance for MI100 and MI250X:

| Data format | MI100 Flops/Clock/CU | MI250X Flops/Clock/CU |  |
|-------------|----------------------|-----------------------|--|
| FP64        | 64                   | 128                   |  |
| FP32        | 128                  | 128                   |  |

#### **Introductory Topics**

ROCm installation Basics of HIP programming

## Advanced Topics

Matrix Cores Register pressure GPU-aware MPI

https://gpuopen.com/learn/amd-lab-notes/ https://github.com/AMD/amd-lab-notes

> AMD together we advance\_

# DISCLAIMERS

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

THIS INFORMATION IS PROVIDED 'AS IS." AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED "AS IS" WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.

© 2023 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD CDNA, AMD ROCm, AMD Instinct, and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.

# ATTRIBUTIONS

Docker and the Docker logo are trademarks or registered trademarks of Docker, Inc.

Git and the Git logo are either registered trademarks or trademarks of Software Freedom Conservancy, Inc., corporate home of the Git Project, in the United States and/or other countries.

Intel is a trademark of Intel Corporation or its subsidiaries.

Kubernetes is a registered trademark of The Linux Foundation.

NAMD was developed by the Theoretical Biophysics Group in the Beckman Institute for Advanced Science and Technology at the University of Illinois at Urbana-Champaign. http://www.ks.uiuc.edu/Research/namd/

OpenCL is a trademark of Apple Inc. used by permission by Khronos Group, Inc.

OpenFOAM is a registered trademark of OpenCFD Limited, producer and distributor of the OpenFOAM software via www.openfoam.com.

The OpenMP name and the OpenMP logo are registered trademarks of the OpenMP Architecture Review Board.

Perl is a trademark of Perl Foundation.

Siemens is a registered trademark of Siemens Product Lifecycle Management Software Inc., or its subsidiaries or affiliates, in the United States and in other countries.

#