ONETEP using GPUs

Author:

Jacek Dziedzic, University of Southampton

Introduction

Starting from v7.3, an OpenACC-based GPU port of ONETEP is available. At this moment, this is a preliminary implementation with four key algorithms having been GPU ported.

These are:

Note that the usual (“slow”) calculation of density and local potential integrals are not GPU-capable. You will not get any improvement from using GPUs unless you switch to fast density and fast local potential integrals – unless you use Hartree-Fock exchange. The improvement in sparse matrix products is modest.

Implementation

ONETEP uses OpenACC for offloading compute-intensive parts of the calculation to GPUs. You will need an OpenACC-capable Fortran compiler to be able to use that. At the time this document is written (2025.04), there are three options:

  1. nvfortran (versions 24.x or 25.x are recommended),

  2. Cray Fortran,

  3. gfortran (suitably compiled).

Only option 1. has been tested thoroughly. Cray Fortran can be used on Archer2, but you are likely to run into problems due to lack of testing of this configuration. Option 3. has not been tested at all, and obtaining an OpenACC-capable gfortran compiler is in itself a daunting task. We recommend, and support, only option 1.

CUDA Fortran is not required for ONETEP, although CUDA libraries (cuFFT, cuBLAS) are. These will be provided with your nvfortran (nvhpc) installation.

Compilation and linking

A number of flags must be passed to the compiler to enable GPU support. We highly recommend using one of the config files provided in config as a template.

Good choices are:
  • conf.iridisx.nvfortran.omp.scalapack.acc,

  • conf.jureca.nvfortran.omp.scalapack.acc,

  • conf.RH9.nvfortran.acc.

The first two use MKL for FFTW, BLAS, LAPACK and ScaLAPACK. The last uses system-wide FFTW, and the BLAS, LAPACK and ScaLAPACK shipped with nvhpc.

If you prefer to build your own compilation line from scratch (not recommended), here are some required flags:

  • -acc to enable OpenACC in nvfortran,

  • -cudalibs to link against CUDA libraries,

  • -lcufft -lcublas to link against cuFFT and cuBLAS,

  • -DGPU_ACC to tell ONETEP to use GPU acceleration.

The following is highly recommended:

  • -DGPU_FFT_CUFFT_ACC to use GPU-accelerated FFTs via cuFFT.

The following is recommended if you have an A100 card:

  • -DGPU_A100 to use an alternative approach in HFx on A100 cards.

The following is recommended if you have much more GPU power than CPU power. For instance, with 4 H100 cards on a 64-core node you would definitely want that, but with 4 A100 cards on a 128-core node you wouldn’t. Differences will not be substantial, so do not worry.

  • -DGPU_DGEMM to use GPUs for sparse matrix products.

All the options are described in detail later in this document.

Once again, it’s much easier to start off with one of the provided config files. Take care to adjust the paths (MKLMODPATH, LIBCUFFTPATH, LIBCUDAPATH, LIBNVSHMEMPATH – if present).

Running jobs

Easy-peasy! onetep_launcher now supports GPUs and will happily set everything up for you. Just make sure to follow the documentation of your HPC system to ensure you ask for GPUs in your submission script. This is typically realised by passing -gres=gpu:n to ask for n GPUs on each node. Some HPC systems (cough, JURECA, cough) instead prefer you don’t this. You might also want to ensure you get exclusive access to the nodes. This may happen automatically or not, depending on your HPC system.

Then, pass the options -G -g n to onetep_launcher, where n is the number of GPUs on a node. The option -g n tells onetep_launcher that you would like to use n GPUs on each node. The option -G tells it to start MPS (multi-process service) for you. This saves you a lot of hassle – a single instance of MPS needs to be started on each node, and onetep_launcher knows how to do this. It will also ensure this worked and abort if it didn’t. If, for some reason, you prefer not to use MPS, do not pass -G. In typical scenarios using MPS gives a modest (10-20%) boost to GPU performance, but your mileage may vary. In particular, if you have a 1:1 mapping between MPI ranks and GPUs, you might want not to use MPS. To start and stop MPS, onetep_launcher uses an auxiliary script called gpu_mps_control. It’s provided in the utils directory.

Starting from v7.3.68, onetep_launcher’s resmon also supports GPUs. That is, if you pass the -r k option to onetep_launcher to ask for monitoring resources every k seconds, in addition to running top, onetep_launcher will also run gpu_top (provided in the utils directory). Results will go into a directory called resmon. This lets you easily monitor the utilisation of the GPU(s) and GPU memory use. GPU OOM scenarios are difficult to debug, and this can help.

Control over GPU acceleration

The following compile-time options are recognized by the GPU port.

Option

Effect

-DGPU_ACC

Enables OpenACC, and thus the GPU port. Required.

-DGPU_FFT_CUFFT_ACC

Uses GPUs (via cuFFT, controlled via OpenACC) for Fourier

interpolation and Fourier filtering in fast density and

fast local potential integrals. Does not affect the rest

of ONETEP. By moving adjacent operations to the GPU,

host to device copyin, and device to host copyout can be

avoided. Highly recommended.

-DGPU_FFT_CUFFT_CUDA

An alternative to -DGPU_FFT_CUFFT_ACC which uses a

CUDA (rather than an OpenACC) backend to cuFFT.

In some scenarios it can offer a modest advantage over

-DGPU_FFT_CUFFT_ACC, but it has not been tested as

thoroughly. Requires CUDA Fortran. Prefer

-DGPU_FFT_CUFFT_ACC.

-DGPU_DGEMM

Moves DGEMM() operations in sparse_product() to

cuBLAS. Reduces the default value of dense_threshold

from 0.10 to 0.05.

This is a naive approach to porting sparse matrix

multiplications to GPUs. Anytime dense blocks that are

larger than 256x256 are multiplied, the multiplication is

offloaded to the GPU(s). This is only modestly faster than

CPU BLAS because of copyin and copyout. The associated

reduction in dense_threshold helps move more matmuls

to the GPU.

If your nodes have a lot of CPU power compared to GPU

power, this might not be advantageous at all. However,

if you have a lot of GPU power, this can speed sparse

algebra substantially.

For instance, with H100 cards it’s likely to help.

With A100 cards it may help if you don’t have very many

cores on a node (e.g. 64 or 48). On a desktop machine

with ~10 cores, it’s likely to help even with a less

powerful GPU. You might want to measure with and without

this setting to see if it helps with performance on your

machine. If you find good speed-ups, you may consider

reducing dense_threshold further, even to 0.0.

-DGPU_A100

Forces the GPU port of Hartree-Fock exchange to use an

alternative parallelisation scheme, suitable for A100

cards. Use this if you have A100 GPUs or if you experience

deadlocks when using HFx on GPUs.

Otherwise, you probably do not need this, although it

may be beneficial to check if it improves performance.

Ignored outside of Hartree-Fock exchange.

There is also one runtime option (specified in the input file) that controls the GPU port:

  • threads_gpu – can be used to adjust the number of OpenMP threads in loops involving FFTs on the GPU(s). This defaults to threads_max (which is the number of OpenMP threads for most of ONETEP – set with onetep_launcher’s -t option or via OMP_NUM_THREADS). However, if this value is large (e.g. 16 or more), it can put a lot of strain on GPU memory. If you find that you run out of GPU memory, reduce this value (e.g. to 4, or as a last resort, to 1). This will vastly reduce the requirement on GPU memory while the reduction in performance should not be dramatic – FFTs are expensive, and even with fewer threads it is often possible to saturate the GPU.

Hartree-Fock exchange

This has been ported only partially. Work is in progress to complete that. You will notice most speed-ups for relatively small systems (under 200 atoms). For best speed-up set:

  • hfx_memory_limit -1 to turn off automatic memory management

  • cache_limit_for_swops 0 to not waste memory on caching spherical wave potentials (“SWOPs”), as they will be recalculated efficiently on the GPU(s),

  • cache_limit_for_expansions n to instead use n MB of RAM per MPI rank for caching expansions (of NGWF pairs in spherical waves), where you should choose n to be as large as possible without exceeding your RAM allowance.

All the usual guidelines in Advanced options still apply.

State of the art

This is a preliminary implementation. It has been tuned for single-node performance, and will likely not scale well to more than several nodes. Work is in progress to address that. Much of the time is still spent on CPUs. We are working on that too.