Performance considerations

Author:

Jacek Dziedzic, University of Southampton

How quickly ONETEP runs your calculations depends on a number of factors. Choosing the number of nodes, processes and threads has already been described in Running ONETEP in parallel environments.

Here we mention techniques that can be used in addition to an efficient parallel decomposition to improve performance.

Fast sparse-to-dense and dense-to-sparse conversions

In many scenarios, ensemble DFT calculations being a notable example, ONETEP needs to convert between two matrix representations – a distributed sparse one (“SPAM3”), and a distributed dense one (“BLACS”). These conversions happen multiple (hundreds) times in a run, and involve substantial data communication. There are two ways in which ONETEP can do these – conveniently termed “slow” and “fast”.

The slow approach is the only approach available until ONETEP 6.1.21. Starting from ONETEP 6.1.22, a fast approach is also available, and is the default. This means that no action is required on your part to use the fast approach. The fast approach is up to several times faster (for this part of the calculation).

You can check which approach you are using by examining the timings printed out at the end of your calculation (provided you are not using timings_level 0). If your timings include sparse_spam3toblacs_real_fast and sparse_blacstospam3_real_fast – you are using the fast approach. If instead you have sparse_spam3toblacs_real_slow and sparse_blacstospam3_real_slow – you are using the slow approach.

To switch between the approaches use:
  • fast_dense_to_sparse T and fast_sparse_to_dense T – for the fast approach,

  • fast_dense_to_sparse F and fast_sparse_to_dense F – for the slow approach.

The fast approach is currently incompatible with the image comms subsystem needed by NEB. This means you cannot use the fast approach with NEB. Please add fast_dense_to_sparse F and fast_sparse_to_dense F to your input file when using NEB.

Fast density calculation (for users)

This is a user-level explanation – for developer-oriented material, see Fast density calculation (for developers). All of the text below applies to ONETEP 7.3.40 and later. Certain subsets of the functionality described below have been available earlier (starting from 7.1.8), but some things, including defaults, changed in the meantime. Make sure you are using ONETEP 7.3.40 or later.

Calculating the electronic charge density is one of the more time-consuming operations in ONETEP. In a typical calculation it has to be performed hundreds of times. There are two ways in which ONETEP can calculate the density – conveniently termed “slow” and “fast”.

The slow approach is the only approach available until ONETEP 7.1.7. Starting with ONETEP 7.1.8, a fast approach is also available, but it is not the default. This means that action is required on your part to use the fast approach. The fast approach is up to several times faster, (for this part of the calculation), and up to 15 times faster when running on a GPU, but that depends heavily on the system. It does require more memory.

The fast approach works best for “serious” systems, it’s not meant to address scenarios with KE cutoffs below 700-800 eV or NGWFs smaller than 8.0 a0. It will also not perform well when your FFT-box is small (say, below 80 x 80 x 80 points), as happens e.g. for very small periodic supercells. In this case it can actually be slower.

To switch between the approaches use:
  • fast_density T – for the fast approach,

  • fast_density F – for the slow approach.

Be aware that the fast approach is an approximation. The approximation is well-controllable, which means you can get as close with the accuracy to the slow (exact) approach as you like, albeit sacrificing performance as you do so. Conversely, you can make the fast approach as fast as you like, but you will be sacrificing accuracy, up to a point of making your results worthless. This means care must be taken when changing the parameters of this approach – non-expert users are advised to use the defaults, or even “safe settings” described below.

Starting from ONETEP 7.1.50 there are actually three different fast density methods implemented. They offer the practically same accuracy for the same settings (see below), but employ different tradeoffs between memory use and performance. As a user, about the only thing you need to know is:

  • fast_density_method 1 is usually the fastest, but requires the most memory,

  • fast_density_method 2 is a failed experiment, and typically performs poorly, you should not be using it.

  • fast_density_method 3 is usually somewhat slower than method 1, but faster than the slow (default) calculation. It has two important advantages – it uses much less memory than fast_density_method 1, and it has been GPU ported. If you are running on GPUs, you should be definitely using this option.

The default (once you specified fast_density T) is fast_density_method 3 (starting from ONETEP 7.3.40).

You can check which approach you are using by examining the timings printed out at the end of your calculation (provided you are not using timings_level 0). If your timings include density_fast_new_ngwfs – you are using one of the fast approaches. If instead you have density_on_dbl_grid and density_batch_interp_deposit – you are using the slow approach.

The main idea behind the fast density approach is trimming interpolated NGWFs, that is, ignoring the points whose (absolute) values are below a prescribed threshold. The value of this threshold, set by trimmed_boxes_threshold is the main parameter controlling the balance between accuracy and efficiency. This parameter is independent of NGWF radii. It also already includes grid weights, so you do not need to adjust it when changing psinc_spacing or cutoff_energy.

Since this is possibly the only parameter worth adjusting, here’s a more detailed description.

Positive values of trimmed_boxes_threshold define the threshold below which NGWF values on the double grid are ignored (“trimmed”). A reasonable value is 1E-6, which should not interfere with NGWF convergence to the default RMS gradient thresholds (which is ngwf_threshold_orig 2E-6). If you need to converge tighter (this can happen e.g. when you set elec_energy_tol), consider reducing this value to get more accuracy – e.g. to 5E-7 or 2E-7 (this is already very tight). You probably don’t want to decrease the threshold below 1E-7. To make your calculation faster, increase the threshold – probably not above 1E-5.

A negative value of trimmed_boxes_threshold turns on adaptive thresholding – the threshold will start at 1E-5 and will progressively decrease during NGWF convergence. A value of -1.0 will set the threshold to a quarter of the numerical value of the NGWF RMS gradient, but never larger than 1E-5. In practice, this means you’ll start at 1E-5 for the first few iterations, and then as your NGWF RMS gradient goes down, so will the threshold. The tighter you want to converge your NGWFs, the tighter the threshold will become. This frees you from worrying about this parameter. Starting from ONETEP 7.3.40, -1.0 is the default if you are optimising NGWFs. If maxit_ngwf_cg is non-positive (implying fixed NGWFs), 1E-6 is used instead – there is nothing to adapt between NGWF steps, because there are no NGWF steps then.

In the unlikely case where you’d want to keep adaptive thresholding but control how it is coupled to the NGWF RMS gradient, you can choose a different negative value. -0.5 will make the threshold equal to an eighth of the numerical value of the NGWF RMS gradient, -2.0 to a half, and so on. The cap always remains at 1E-5.

If you use trimmed_boxes_output_detail VERBOSE (or higher), ONETEP will print out a summary of the approximation every time NGWFs change. It will look similar to Fig. 11. This includes the current value of the trimming threshold (first line), and the quality of the approximation (accuracy of approximation). The latter tells you to how many digits the approximated NGWF charge is equal to the exact (double FFT-box) NGWF charge, in the root-mean-square sense over all NGWFs in the system. In this example our approximated charge is no further from 1.0 (a correctly normalized NGWF) than ~1E-8.

As your calculation progresses, this value will fluctuate, and is likely go down slightly, as the NGWFs become more diffuse. As a rule of thumb, if it gets below 5.0-6.0, you will have difficulty converging NGWFs to the default threshold. If it is above 9.0, you are probably using too much accuracy, losing efficiency as you do that.

Fast density -- information on accuracy and memory use.

Fig. 11 The summary printed by fast density every time the NGWFs change. Of main interest are: accuracy of approximation (shown in red) and estimated high-memory watermark per MPI rank (shown in yellow).

Another notable quantity in Fig. 11 shown in the last line. Depending on method, it can be called Final NGWF data memory load or estimated high-memory watermark per MPI rank. It’s shown here in yellow. This is a reminder that the fast density approach uses significantly more memory than the slow approach. The value in the printout is the expected maximum memory that fast density uses per MPI rank. If your printout is truncated before you reached this line, you most likely already ran out of memory. At this stage, we use an all-or-nothing approach – there is no way to give the algorithm a memory allowance and tell it that it should not consume more. Work on this is in progress. The best way to reduce memory load is to use fewer processes per node and more threads. If this is not sufficient, you can reduce the memory load by using more nodes, but this is not a linear dependence – i.e. you will not reduce the load by a factor of two if you add twice as many nodes. Note that what is printed out is the amount of memory consumed by the fast density approach, not by all of ONETEP.

When is fast density used?

Fast density is only used for energy evaluations done from hamiltonian_mod – via hamiltonian_lhxc_calculate() and hamiltonian_energy_components(). These are the costly density calculations, because they are done hundreds of times in the course of a calculation. All other density calculations (done in forces, properties, eigenstates, linear response, lr_tddft, population, dma, dmft, EDA, implicit solvent restarts) are always done using the exact (slow) method. The rationale is that these are done much less often and possibly require more accuracy.

If you want to know precisely when the fast and slow routines are called, specify trimmed_boxes_output_detail PROLIX or higher.

Accuracy

The default settings should give you sufficient accuracy to converge NGWFs to any reasonable threshold and to get energies and forces that are negligibly different from those obtained with the slow approach. However, for more difficult systems, particularly if using low kinetic energy cutoffs (say, below 700 eV – like would probably be used with PAW), you might need to tighten the parameters to get desired accuracy.

In addition to adjusting trimmed_boxes_threshold as explained earlier, you may want to use fast_density_off_for_last T (the default is F). This will tell ONETEP to use the slow (but exact) approach for the final energy evaluation. You will know this happened by examining the output file and looking for:

! Looks like the last energy evaluation.
! The fast density calculation will now be disabled in the interest of accuracy.

Note that this will not be printed if trimmed_boxes_output_detail is BRIEF or if fast density would have already been switched off by fast_density_elec_energy_tol (see below). This setting resets any time you start a new NGWF convergence loop – that means that in auto solvation, geometry optimisation, MD, etc. each optimisation will start with fast density turned on.

Also note that this switching is done in the NGWF convergence loop. If you are working with fixed NGWFs (maxit_ngwf_cg 0 (or negative)), this switching will not take place.

Furthermore, particularly if your calculation struggles to converge to the default NGWF threshold, you can set fast_density_elec_energy_tol. This is the energy change per atom between NGWF steps below which ONETEP will switch to the slow (but exact) approach. It’s the same quantity that is used as the energy convergence criterion in elec_energy_tol. The default is 1E-50, effectively turning this off. Setting it to 1E-7 will typically have ONETEP switch to the slow approach for the last few NGWF iterations. The higher you set this, the sooner ONETEP will switch to the slow approach. This, of course, eats into your efficiency gain. You will know if and when this happened by examining the output file and looking for:

! Energy change per atom: 0.30287E-07 Eh < 0.10000E-06.
! The fast density calculation will now be disabled in the interest of accuracy.

Note that this will not be printed if trimmed_boxes_output_detail is BRIEF. This setting resets any time you start a new NGWF convergence loop – that means that in auto solvation, geometry optimisation, MD, etc. each optimisation will start with fast density turned on.

Note that you need at least two NGWF iterations to have a meaningful energy change to examine, so this setting has no effect if you take fewer than two NGWF iterations.

Remaining options

The default output detail of fast density is the same as specified for output_detail. You can set it separately by specifying trimmed_boxes_output_detail. The available options are the same as for all ONETEP output details: BRIEF, NORMAL, VERBOSE, PROLIX and MAXIMUM.

Example settings

For a quick-and-dirty calculation use:
  • fast_density T

  • trimmed_boxes_threshold 1E-5.

For a typical calculation just use:
  • fast_density T (which will use the default of trimmed_boxes_threshold -1.0 for optimised NGWFs, or 1E-6 for fixed NGWFs).

For an accurate, but slower calculation use:
  • fast_density T

  • trimmed_boxes_threshold 7E-7

  • fast_density_off_for_last T

  • fast_density_elec_energy_tol 1E-7.

For very safe settings that should provide a modest gain in efficiency, try:
  • fast_density T

  • trimmed_boxes_threshold 4E-7

  • fast_density_off_for_last T

  • fast_density_elec_energy_tol 3E-7.

If you keep running out of memory, ensure you are not using``fast_density_method 1``. Try fast_density_method 3 instead.

If you are running ONETEP on GPUs, most definitely use fast_density_method 3.

Compatilibity

Fast density is known to work (to the best of our knowledge) with the following additional functionalities:
  • extended NGWFs,

  • PBCs and OBCs,

  • implicit solvation,

  • hybrid functionals and Hartree-Fock exchange,

  • fine_grid_scale larger than 2.0,

  • PAW,

  • DFT+U,

  • conduction,

  • MD,

  • geometry optimisation,

  • TS search,

  • NEB,

  • EDFT and LNV.

Fast density is known not to work (this we know with certainty) with the following additional functionalities:
  • complex NGWFs (and, thus, k-points),

  • spin-polarised NGWFs (but spin-polarised density kernel is compatible),

  • TD-DFT (mixed bases are not supported at this point).

  • EMFT (regions).

ONETEP will stop with an error if either of these is used with fast_density T.

Fast local potential integrals (for users)

This is a user-level explanation – for developer-oriented material, see Fast local potential integrals (for developers).

The calculation of local potential integrals is another time-consuming part of ONETEP. In a typical calculation it has to be performed hundreds of times. There are two ways in which ONETEP can calculate the local potential integrals – conveniently termed “slow” and “fast”.

The slow approach is the only approach available until ONETEP 7.1.49. Starting with ONETEP 7.1.50, a fast approach is also available, but it is not the default. This means that action is required on your part to use the fast approach. The fast approach is up to several times faster (for this part of the calculation), but that depends heavily on the system. It does require more memory.

The fast approach for local potential integrals uses similar techniques as Fast density calculation (for users), that is trimming of data in double-grid FFT-boxes, which is a well-controllable approximation, but an approximation nevertheless. It would be prudent to read the section on Fast density calculation (for users), and the part about controlling accuracy in particular. The same mechanism is used here (trimmed_boxes_threshold).

The fast approach works best for “serious” systems, it’s not meant to address scenarios with KE cutoffs below 700-800 eV or NGWFs smaller than 8.0 a0. It will also not perform well when your FFT-box is small (say, below 80 x 80 x 80 points), as happens e.g. for very small periodic supercells. In this case it can actually be slower.

To switch between the approaches use:
  • fast_locpot_int T – for the fast approach,

  • fast_locpot_int F – for the slow approach.

In contrast to fast density, there is only one fast locpot int approach implemented, so there is no need to choose a method, just turning it on is sufficient. The fast locpot int approach works best when fast_density T is in use (regardless of fast_density_method), as they share some of the workload and memory requirement. You can expect good synergy when using both approaches at the same time.

There are no additional settings for fast local potential integrals at this point (apart from trimmed_boxes_threshold), simply turning it on is sufficient. For pointers about about settings, see the suggested settings in Fast density calculation (for users), just add fast_locpot_int T to any of them.

A GPU port of fast local potential integrals is in place (starting from ONETEP 7.1.50). It is activated automatically if you run a GPU-capable binary.

Fast NGWFs (for users)

This is a user-level explanation – for developer-oriented material, see Fast NGWFs (for developers).

This is an experimental feature at this point (January 2025). The PPD representation of NGWFs in ONETEP can be replaced by a faster representation known as the rod representation. This can be done with fast_ngwfs T.

Currently this is only used when fast_locpot_int T is in effect, and you will see zero effect otherwise. Even with fast_locpot_int T, you are unlikely to see much benefit at this point, unless you are running on a GPU. On a GPU you can expect modest improvements in performance.

The default is fast_ngwfs F.