================= ONETEP using GPUs ================= :Author: Jacek Dziedzic, University of Southampton Introduction ============ Starting from v7.3, an OpenACC-based GPU port of ONETEP is available. At this moment, this is a preliminary implementation with four key algorithms having been GPU ported. These are: - fast density (see :ref:`user_fast_density`), - fast local potential integrals (see :ref:`user_fast_locpot_int`), - Hartree-Fock exchange, - sparse matrix products. Note that the usual ("slow") calculation of density and local potential integrals are not GPU-capable. You will not get any improvement from using GPUs unless you switch to fast density and fast local potential integrals -- unless you use Hartree-Fock exchange. The improvement in sparse matrix products is modest. Implementation ============== ONETEP uses OpenACC for offloading compute-intensive parts of the calculation to GPUs. You will need an OpenACC-capable Fortran compiler to be able to use that. At the time this document is written (2025.04), there are three options: 1. nvfortran (versions 24.x or 25.x are recommended), 2. Cray Fortran, 3. gfortran (suitably compiled). Only option 1. has been tested thoroughly. Cray Fortran can be used on Archer2, but you are likely to run into problems due to lack of testing of this configuration. Option 3. has not been tested at all, and obtaining an OpenACC-capable gfortran compiler is in itself a daunting task. We recommend, and support, only option 1. CUDA Fortran is not required for ONETEP, although CUDA libraries (cuFFT, cuBLAS) are. These will be provided with your nvfortran (``nvhpc``) installation. Compilation and linking ======================= A number of flags must be passed to the compiler to enable GPU support. We highly recommend using one of the config files provided in ``config`` as a template. Good choices are: - ``conf.iridisx.nvfortran.omp.scalapack.acc``, - ``conf.jureca.nvfortran.omp.scalapack.acc``, - ``conf.RH9.nvfortran.acc``. The first two use MKL for FFTW, BLAS, LAPACK and ScaLAPACK. The last uses system-wide FFTW, and the BLAS, LAPACK and ScaLAPACK shipped with ``nvhpc``. If you prefer to build your own compilation line from scratch (not recommended), here are some required flags: - ``-acc`` to enable OpenACC in nvfortran, - ``-cudalibs`` to link against CUDA libraries, - ``-lcufft -lcublas`` to link against cuFFT and cuBLAS, - ``-DGPU_ACC`` to tell ONETEP to use GPU acceleration. The following is highly recommended: - ``-DGPU_FFT_CUFFT_ACC`` to use GPU-accelerated FFTs via cuFFT. The following is recommended if you have an A100 card: - ``-DGPU_A100`` to use an alternative approach in HFx on A100 cards. The following is recommended if you have much more GPU power than CPU power. For instance, with 4 H100 cards on a 64-core node you would definitely want that, but with 4 A100 cards on a 128-core node you wouldn't. Differences will not be substantial, so do not worry. - ``-DGPU_DGEMM`` to use GPUs for sparse matrix products. All the options are described in detail later in this document. Once again, it's much easier to start off with one of the provided config files. Take care to adjust the paths (``MKLMODPATH``, ``LIBCUFFTPATH``, ``LIBCUDAPATH``, ``LIBNVSHMEMPATH`` -- if present). Running jobs ============ Easy-peasy! ``onetep_launcher`` now supports GPUs and will happily set everything up for you. Just make sure to follow the documentation of your HPC system to ensure you ask for GPUs in your submission script. This is typically realised by passing ``-gres=gpu:n`` to ask for *n* GPUs on each node. Some HPC systems (*cough*, JURECA, *cough*) instead prefer you don't this. You might also want to ensure you get exclusive access to the nodes. This may happen automatically or not, depending on your HPC system. Then, pass the options ``-G -g n`` to ``onetep_launcher``, where *n* is the number of GPUs on a node. The option ``-g n`` tells ``onetep_launcher`` that you would like to use *n* GPUs on each node. The option ``-G`` tells it to start MPS (multi-process service) for you. This saves you a lot of hassle -- a *single* instance of MPS needs to be started on each node, and ``onetep_launcher`` knows how to do this. It will also ensure this worked and abort if it didn't. If, for some reason, you prefer not to use MPS, do not pass ``-G``. In typical scenarios using MPS gives a modest (10-20%) boost to GPU performance, but your mileage may vary. In particular, if you have a 1:1 mapping between MPI ranks and GPUs, you might want not to use MPS. To start and stop MPS, ``onetep_launcher`` uses an auxiliary script called ``gpu_mps_control``. It's provided in the ``utils`` directory. Starting from v7.3.68, ``onetep_launcher``'s ``resmon`` also supports GPUs. That is, if you pass the ``-r k`` option to ``onetep_launcher`` to ask for monitoring resources every *k* seconds, in addition to running ``top``, ``onetep_launcher`` will also run ``gpu_top`` (provided in the ``utils`` directory). Results will go into a directory called ``resmon``. This lets you easily monitor the utilisation of the GPU(s) and GPU memory use. GPU OOM scenarios are difficult to debug, and this can help. Control over GPU acceleration ============================= The following compile-time options are recognized by the GPU port. +--------------------------+-----------------------------------------------------------+ | Option | Effect | +==========================+===========================================================+ | ``-DGPU_ACC`` | Enables OpenACC, and thus the GPU port. Required. | +--------------------------+-----------------------------------------------------------+ | ``-DGPU_FFT_CUFFT_ACC`` | Uses GPUs (via cuFFT, controlled via OpenACC) for Fourier | | | | | | interpolation and Fourier filtering in fast density and | | | | | | fast local potential integrals. Does not affect the rest | | | | | | of ONETEP. By moving adjacent operations to the GPU, | | | | | | host to device copyin, and device to host copyout can be | | | | | | avoided. Highly recommended. | +--------------------------+-----------------------------------------------------------+ | ``-DGPU_FFT_CUFFT_CUDA`` | An alternative to ``-DGPU_FFT_CUFFT_ACC`` which uses a | | | | | | CUDA (rather than an OpenACC) backend to cuFFT. | | | | | | In some scenarios it can offer a modest advantage over | | | | | | ``-DGPU_FFT_CUFFT_ACC``, but it has not been tested as | | | | | | thoroughly. Requires CUDA Fortran. Prefer | | | | | | ``-DGPU_FFT_CUFFT_ACC``. | +--------------------------+-----------------------------------------------------------+ | ``-DGPU_DGEMM`` | Moves ``DGEMM()`` operations in ``sparse_product()`` to | | | | | | cuBLAS. Reduces the default value of ``dense_threshold`` | | | | | | from 0.10 to 0.05. | | | | | | This is a naive approach to porting sparse matrix | | | | | | multiplications to GPUs. Anytime dense blocks that are | | | | | | larger than 256x256 are multiplied, the multiplication is | | | | | | offloaded to the GPU(s). This is only modestly faster than| | | | | | CPU BLAS because of copyin and copyout. The associated | | | | | | reduction in ``dense_threshold`` helps move more matmuls | | | | | | to the GPU. | | | | | | If your nodes have a lot of CPU power compared to GPU | | | | | | power, this might not be advantageous at all. However, | | | | | | if you have a lot of GPU power, this can speed sparse | | | | | | algebra substantially. | | | | | | For instance, with H100 cards it's likely to help. | | | | | | With A100 cards it may help if you don't have very many | | | | | | cores on a node (e.g. 64 or 48). On a desktop machine | | | | | | with ~10 cores, it's likely to help even with a less | | | | | | powerful GPU. You might want to measure with and without | | | | | | this setting to see if it helps with performance on your | | | | | | machine. If you find good speed-ups, you may consider | | | | | | reducing ``dense_threshold`` further, even to 0.0. | +--------------------------+-----------------------------------------------------------+ | ``-DGPU_A100`` | Forces the GPU port of Hartree-Fock exchange to use an | | | | | | alternative parallelisation scheme, suitable for A100 | | | | | | cards. Use this if you have A100 GPUs or if you experience| | | | | | deadlocks when using HFx on GPUs. | | | | | | Otherwise, you probably do not need this, although it | | | | | | may be beneficial to check if it improves performance. | | | | | | Ignored outside of Hartree-Fock exchange. | +--------------------------+-----------------------------------------------------------+ There is also one runtime option (specified in the input file) that controls the GPU port: - ``threads_gpu`` -- can be used to adjust the number of OpenMP threads in loops involving FFTs on the GPU(s). This defaults to ``threads_max`` (which is the number of OpenMP threads for most of ONETEP -- set with ``onetep_launcher``'s ``-t`` option or via ``OMP_NUM_THREADS``). However, if this value is large (e.g. 16 or more), it can put a lot of strain on GPU memory. If you find that you run out of GPU memory, reduce this value (e.g. to 4, or as a last resort, to 1). This will vastly reduce the requirement on GPU memory while the reduction in performance should not be dramatic -- FFTs are expensive, and even with fewer threads it is often possible to saturate the GPU. Hartree-Fock exchange ===================== This has been ported only partially. Work is in progress to complete that. You will notice most speed-ups for relatively small systems (under 200 atoms). For best speed-up set: - ``hfx_memory_limit -1`` to turn off automatic memory management - ``cache_limit_for_swops 0`` to not waste memory on caching spherical wave potentials ("SWOPs"), as they will be recalculated efficiently on the GPU(s), - ``cache_limit_for_expansions n`` to instead use *n* MB of RAM per MPI rank for caching expansions (of NGWF pairs in spherical waves), where you should choose *n* to be as large as possible without exceeding your RAM allowance. All the usual guidelines in :ref:`hfx_advanced` still apply. State of the art ================ This is a preliminary implementation. It has been tuned for single-node performance, and will likely not scale well to more than several nodes. Work is in progress to address that. Much of the time is still spent on CPUs. We are working on that too.