

**University of Stuttgart** Institute of Aerodynamics and Gas Dynamics





#### LUMI Hackathon: Porting GALÆXI To AMD Accelerators





Member of the code framework FLEX

with growing feature set

- Developed by the Numerics Research Group IAG, University of Stuttgart, Germany
- OpenSource HPC solver for unsteady compressible Navier–Stokes eq.
- High Order Discontinuous Galerkin Spectral Element Method



https://numericsresearchgroup.org



https://github.com/flexiframework/galaexi





- Member of the code framework FLEX with growing feature set
- Developed by the Numerics Research Group IAG, University of Stuttgart, Germany
- OpenSource HPC solver for unsteady compressible Navier–Stokes eq.
- High Order Discontinuous Galerkin Spectral Element Method
- Current feature set
  - Supported: NVIDIA (CUDA Fortran)
  - Running CI/CD pipeline
  - Gauss / Gauss-Lobatto points
  - Variable polynomial degree
  - Several Riemann solv. (comp. flag)
  - Split DG (several types, comp. flag)

- Boundary conditions
- MPI parallelized
- Shock capturing
- Sponge zones
- Testcases (TGV, channel, ...)
- ...



## GALÆXI: DGSEM

Semi-discrete formulation of the DGSEM (weak form):



## **GALÆXI v1.0: Implementation**

- CUDA Fortran
- Design principles
  - Retain the general data structure and parallelization strategy of FLEXI for unstructured meshes
  - 2. Retain the majority of the codebase and the associated features of the original Implementation
  - 3. All routines called during the time-stepping are executed on the accelerator without the need of data transfers
- Inter-GPU parallelization: Distributed memory approach using GPUaware MPI and systematic hiding of communication latency
- Intra-GPU parallelization: Launch multiple kernels concurrently and asynchronously to *maximize the occupancy* using e.g. *streams*
- Kernel implementation: Maximize the utilization of the available parallel resources, individual parallelization strategies for *pointwise* and *volume surface* operations



### **GALÆXI v1.0: Performance Evaluation**

- Performance index:  $PID = \frac{Walltime \times \#Ranks}{\#RK\text{-stages} \times \#DOF}$
- PID describes the walltime required by a single rank to advance a single DOF for one stage of the explicit Runge–Kutta time-stepping



# GALÆXI

## GALÆXI v2.0

## GALÆXI v2.0

- Retain design principles from GALÆXI v1.0
- Allow computation on ALL architectures (including CPU)
  - Rewrite all CUDA Fortran kernels in C.
- Governing philosophy of multi-backend approach
  - Fortran = host code
  - C = device code
- Host code (Fortran) calls device code (C) through interfaces
  - Interfaces wrap CUDA/HIP API calls and Kernels
  - Hides GPU code for those who don't want/need to interact with it
  - Easily extendable to multiple backends/vendors

#### ALLOCATE(U( PP\_nVar,0:PP\_N,0:PP\_N,0:PP\_NZ,nElems)) CALL AllocateDeviceMemory("d\_U", SIZE\_C\_DOUBLE, SIZE(U)) CALL CopyToDevice("d\_U", U, SIZE(U)) CALL CopyFromDevice(U, "d\_U", SIZE(U))

| <pre>void AllocateDeviceMemory(char* dVarKey, size_t typeSize_bytes, in<br/>{</pre>                         | t arraySize) |
|-------------------------------------------------------------------------------------------------------------|--------------|
| void* d_arr;                                                                                                |              |
| <pre>#if (USE_ACCEL == ACCEL_CUDA)</pre>                                                                    |              |
| DEVICE_ERR_CHECK( cudaMalloc(&d_arr, typeSize_bytes*arraySize)                                              | );           |
| <pre>#elif (USE_ACCEL == ACCEL_HIP) DEVICE_ERR_CHECK( hipMalloc(&amp;d_arr, typeSize_bytes*arraySize)</pre> | ) ·          |
| #elif (USE ACCEL == ACCEL HYBRID)                                                                           | <i>,</i>     |
| d_arr = NULL;                                                                                               |              |
| #endif                                                                                                      |              |
| Dovico//oro[d//ork/ov] - d orr                                                                              |              |
| DeviceVars[dVarKey] = d_arr;<br>ו                                                                           |              |
|                                                                                                             |              |

## **GALÆXI v2.0: Development Plan**

- Phase 0 (Finished)
  - Updates to supporting function surrounding code (CI/CD, unit testing, build, etc.)
  - Incorporated device memory management methods
- Phase 1 (In Progress Due 13.12.2024)
  - Port core kernels from CUDA Fortran to CUDA/HIP C++.
  - Code able to run inviscid, incompressible test problem on a single GPU (all architectures)
- Current Status
  - Core kernels rewritten in C
  - Computation of inviscid test problem is possible on a single NVIDIA GPU with new kernels
  - No support for AMD
- Phase 2 and Phase 3 (Planned)
  - Reintroduce multi-GPU support
  - Support for more physics and features (viscous flow, shock capturing, WMLES, etc.)

## GALÆXI

## **Goals For Hackathon**

## **LUMI Hackathon Goals**

- Easy
  - Extend device support code (memory management, kernel launches, etc) to support HIP
  - Build with HIP compiler
  - Familiarize with AMD profiling tools
- Better
  - Complete adjustments to kernels (if needed)
  - Code can run test problem on LUMI-G nodes
- Best
  - Tuning and optimization
  - Start looking at hybrid architectures



**University of Stuttgart** Institute of Aerodynamics and Gas Dynamics

## **THANK YOU!**





