# Performance considerations

- Author:
Jacek Dziedzic, University of Southampton

How quickly ONETEP runs your calculations depends on a number of factors. Choosing the number of nodes, processes and threads has already been described in Running ONETEP in parallel environments.

Here we mention techniques that can be used in addition to an efficient parallel decomposition to improve performance.

## Fast sparse-to-dense and dense-to-sparse conversions

In many scenarios, ensemble DFT calculations being a notable example, ONETEP needs to convert between two matrix representations – a distributed sparse one (“SPAM3”), and a distributed dense one (“BLACS”). These conversions happen multiple (hundreds) times in a run, and involve substantial data communication. There are two ways in which ONETEP can do these – conveniently termed “slow” and “fast”.

The slow approach is the only approach available until ONETEP 6.1.21. Starting from ONETEP 6.1.22, a fast approach
is also available, *and is the default*. This means that no action is required on your part to use the fast approach.
The fast approach is up to several times faster (for this part of the calculation).

You can check which approach you are using by examining the timings printed out at the end of your calculation
(provided you are not using `timings_level 0`

). If your timings include `sparse_spam3toblacs_real_fast`

and
`sparse_blacstospam3_real_fast`

– you are using the fast approach. If instead you have `sparse_spam3toblacs_real_slow`

and `sparse_blacstospam3_real_slow`

– you are using the slow approach.

- To switch between the approaches use:
`fast_dense_to_sparse T`

and`fast_sparse_to_dense T`

– for the fast approach,`fast_dense_to_sparse F`

and`fast_sparse_to_dense F`

– for the slow approach.

The fast approach is currently incompatible with the image comms subsystem needed by NEB. This means you cannot
use the fast approach with NEB. Please add `fast_dense_to_sparse F`

and `fast_sparse_to_dense F`

to your input
file when using NEB.

## Fast density calculation (for users)

This is a user-level explanation – for developer-oriented material, see Fast density calculation (for developers).

Calculating the electronic charge density is one of the more time-consuming operations in ONETEP. In a typical calculation it has to be performed hundreds of times. There are two ways in which ONETEP can calculate the density – conveniently termed “slow” and “fast”.

The slow approach is the only approach available until ONETEP 7.1.7.
Starting with ONETEP 7.1.8, a fast approach is also available, *but it is not the default*.
This means that action is required on your part to use the fast approach.
The fast approach is up to several times faster (for this part of the calculation),
but that depends heavily on the system. It does require more memory.

**The fast approach works best for “serious” systems, it’s not meant to address
scenarios with KE cutoffs below 700-800 eV or NGWFs smaller than 8.0 a0. It will
also not perform well when your FFT-box is small (say, below 80 x 80 x 80 points),
as happens e.g. for very small periodic supercells. In this case it can actually be slower.**

- To switch between the approaches use:
`fast_density T`

– for the fast approach,`fast_density F`

– for the slow approach.

Be aware that the fast approach *is an approximation*. The approximation is well-controllable,
which means you can get as close with the accuracy to the slow (exact) approach as you
like, albeit sacrificing performance as you do so.
Conversely, you can make the fast approach as fast as you like, but you will be
sacrificing accuracy as you do so, up to a point of making your results worthless.
This means care must be taken when changing the parameters of this approach –
non-expert users are advised to use the defaults, or even “safe settings” described below.

Starting from ONETEP 7.1.50 there are actually *three* different fast density
methods implemented. They offer the practically same accuracy for the same
settings (see below), but employ different tradeoffs between memory use and
performance. As a user, about the only thing you need to know is:

`fast_density_method 1`

is usually the fastest, but requires the most memory,`fast_density_method 2`

is a failed experiment, and typically performs poorly, you should not be using it.`fast_density_method 3`

is usually somewhat slower than method 1, but faster than the slow (default) calculation. It has two important advantages –**it uses much less memory than**`fast_density_method 1`

,**and it has been GPU ported**. If you are running on GPUs, you should be definitely using this option.

The default (once you specified `fast_density T`

) is `fast_density_method 1`

.

You can check which approach you are using by examining the timings printed out
at the end of your calculation
(provided you are not using `timings_level 0`

). If your timings include
`density_fast_new_ngwfs`

– you are using one of the fast approaches. If instead you
have `density_on_dbl_grid`

and `density_batch_interp_deposit`

– you are
using the slow approach.

The main idea behind the fast density approach is *trimming* interpolated NGWFs, that is, ignoring the points
where their (absolute) values are below a prescribed threshold. The value of this threshold, set by `trimmed_boxes_threshold`

is the main parameter controlling the balance between accuracy and efficiency.
It is also independent of NGWF radii. The default is `2E-6`

.
The parameter already includes grid weights, so you *do not* need to adjust it when changing `psinc_spacing`

or
`cutoff_energy`

. To make your calculation more accurate, decrease the threshold – probably not below `1E-7`

.
To make your calculation faster, increase the threshold – probably not above `1E-5`

.

If you use `trimmed_boxes_output_detail VERBOSE`

(or higher), ONETEP will print out an estimate of the accuracy
of the approximation every time NGWFs change. It will look like something like Fig. 11,
see the *accuracy of approximation* line. This tells you to how many digits the approximated NGWF charge is equal
to the exact (double FFT-box) NGWF charge, in the root-mean-square sense over all NGWFs in the system. In this
example our approximated charge is no further from 1.0 (a correctly normalized NGWF) than by 1E-7
(and is slightly closer, because we got 7.16, not 7.0).

As your calculation progresses, this value will fluctuate, and is likely go down slightly, as the NGWFs become more diffuse. As a rule of thumb, if it gets below 5.0-6.0, you will have difficulty converging NGWFs to the default threshold. If it is above 9.0, you are probably using too much accuracy, losing efficiency as you do that.

Another notable quantity in Fig. 11 is the *estimated high-memory watermark per MPI rank*
(shown in yellow). This is a reminder that the fast density approach uses significantly more memory than the slow approach.
The value in the printout is the expected *maximum* memory that fast density uses *per MPI rank*. If your printout is
truncated before you reached this line, you most likely already ran out of memory. At this stage, we use an all-or-nothing
approach – there is no way to give the algorithm a memory allowance and tell it that it should not consume more. Work on
this is in progress. The best way to reduce memory load is to use fewer processes per node and more threads. If this is
not sufficient, you can reduce the memory load by using more nodes, but this is not a linear dependence – i.e. you will
*not* reduce the load by a factor of two if you add twice as many nodes. Note that what is printed out is the
amount of memory consumed by the fast density approach, not by all of ONETEP.
Finally, the *estimated high-memory watermark* is not yet printed for fast_density_method 3.

### When is fast density used?

Fast density is only used for energy evaluations done from `hamiltonian_mod`

– via `hamiltonian_lhxc_calculate()`

and `hamiltonian_energy_components()`

. These are the costly density calculations, because they are done hundreds
of times in the course of a calculation. All other density calculations (done in forces, properties, eigenstates,
linear response, lr_tddft, population, dma, dmft, EDA, implicit solvent restarts) are always done using the exact
(slow) method. The rationale is that these are done much less often and possibly require more accuracy.

If you want to know when the fast and slow routines are called, specify `trimmed_boxes_output_detail PROLIX`

or higher.

### More accuracy

The default settings should give you sufficient accuracy to converge NGWFs to the default threshold and to get energies and forces that are negligibly different from those obtained with the slow approach. However, for more difficult systems, particularly if using low kinetic energy cutoffs (say, below 700 eV – like would probably be used with PAW), you might need to adjust the parameters to get desired accuracy.

In addition to adjusting `trimmed_boxes_threshold`

down (to perhaps 1E-6 or 5E-7), you may want to use
`fast_density_off_for_last T`

(the default is `F`

). This will tell ONETEP to use the slow (but exact) approach for
the final energy evaluation. You will know this happened by examining the output file and looking for:

```
! Looks like the last energy evaluation.
! The fast density calculation will now be disabled in the interest of accuracy.
```

Note that this will not be printed if `trimmed_boxes_output_detail`

is `BRIEF`

or if fast density would already
have been switched off by `fast_density_elec_energy_tol`

(see below). This setting resets any time you start a new
NGWF convergence loop – that means that in auto solvation, geometry optimisation, MD, etc. each optimisation will
start with fast density turned on.

Also note that this switching is done in the NGWF convergence loop. If you are working with fixed NGWFs
(`maxit_ngwf_cg 0`

(or negative)), this switching will not take place.

Furthermore, particularly if your calculation struggles to converge to the default
NGWF threshold, you can set `fast_density_elec_energy_tol`

. This is the energy change per atom between NGWF steps
below which ONETEP will switch to the slow (but exact) approach. It’s the same quantity that is used as the energy
convergence criterion in `elec_energy_tol`

. The default is `1E-50`

, effectively turning this off. Setting it
to `1E-7`

will typically have ONETEP switch to the slow approach for the last few NGWF iterations. The higher
you set this, the sooner ONETEP will switch to the slow approach. This, of course, eats into your efficiency gain.
You will know if and when this happened by examining the output file and looking for:

```
! Energy change per atom: 0.30287E-07 Eh < 0.10000E-06.
! The fast density calculation will now be disabled in the interest of accuracy.
```

Note that this will not be printed if `trimmed_boxes_output_detail`

is `BRIEF`

. This setting resets any time
you start a new NGWF convergence loop – that means that in auto solvation, geometry optimisation, MD, etc. each
optimisation will start with fast density turned on.

Note that you need at least two NGWF iterations to have a meaningful energy change to examine, so this setting has no effect if you take fewer than two NGWF iterations.

### Remaining options

The default output detail of fast density is the same as specified for `output_detail`

. You can set it separately
by specifying `trimmed_boxes_output_detail`

. The available options are the same as for all ONETEP output details:
`BRIEF`

, `NORMAL`

, `VERBOSE`

, `PROLIX`

and `MAXIMUM`

.

### Example settings

- For a quick-and-dirty calculation use:
`fast_density T`

`trimmed_boxes_threshold 2E-5`

.

- For a typical calculation just use:
`fast_density T`

(which will use the default of`trimmed_boxes_threshold 2E-6`

).

- For an accurate, but slower calculation use:
`fast_density T`

`trimmed_boxes_threshold 1E-6`

`fast_density_off_for_last T`

`fast_density_elec_energy_tol 1E-7`

.

- For very safe settings that should provide a modest gain in efficiency, try:
`fast_density T`

`trimmed_boxes_threshold 5E-7`

`fast_density_off_for_last T`

`fast_density_elec_energy_tol 3E-7`

.

- If you keep running out of memory, try adding
`fast_density_method 3`

to one of the above sets of settings.

- If you are running ONETEP on GPUs, most definitely add
`fast_density_method 3`

to one of the above sets of settings.

### Compatilibity

- Fast density is known to work (to the best of our knowledge) with the following additional functionalities:
extended NGWFs,

PBCs and OBCs,

implicit solvation,

hybrid functionals and Hartree-Fock exchange,

`fine_grid_scale`

larger than`2.0`

,PAW,

DFT+U,

conduction,

MD,

geometry optimisation,

TS search,

NEB,

EDFT and LNV.

- Fast density is known
*not*to work (this we know with certainty) with the following additional functionalities: complex NGWFs (and, thus, k-points),

spin-polarised NGWFs (but spin-polarised density kernel is compatible),

TD-DFT (mixed bases are not supported at this point).

EMFT (regions).

ONETEP will stop with an error if either of these is used with `fast_density T`

.

## Fast local potential integrals (for users)

This is a user-level explanation – for developer-oriented material, see Fast local potential integrals (for developers).

The calculation of local potential integrals is another time-consuming part of ONETEP. In a typical calculation it has to be performed hundreds of times. There are two ways in which ONETEP can calculate the local potential integrals – conveniently termed “slow” and “fast”.

The slow approach is the only approach available until ONETEP 7.1.49.
Starting with ONETEP 7.1.50, a fast approach is also available, *but it is not the default*.
This means that action is required on your part to use the fast approach.
The fast approach is up to several times faster (for this part of the calculation),
but that depends heavily on the system. It does require more memory.

The fast approach for local potential integrals uses similar techniques as Fast density calculation (for users),
that is *trimming* of data in double-grid FFT-boxes, which is a well-controllable approximation,
but an approximation nevertheless. It would be prudent to read the section on Fast density calculation (for users),
and the part about controlling accuracy in particular. The same mechanism is
used here (trimmed_boxes_threshold).

**The fast approach works best for “serious” systems, it’s not meant to address
scenarios with KE cutoffs below 700-800 eV or NGWFs smaller than 8.0 a0. It will
also not perform well when your FFT-box is small (say, below 80 x 80 x 80 points),
as happens e.g. for very small periodic supercells. In this case it can actually be slower.**

- To switch between the approaches use:
`fast_locpot_int T`

– for the fast approach,`fast_locpot_int F`

– for the slow approach.

In contrast to fast density, there is only one fast locpot int approach implemented, so there is no need to choose a method, just turning it on is sufficient. The fast locpot int approach works best when fast_density T is in use (regardless of fast_density_method), as they share some of the workload and memory requirement. You can expect good synergy when using both approaches at the same time.

There are *no* additional settings for fast locpot int at this point, simply turning
it on is sufficient. For pointers about about settings, see the suggested settings
in Fast density calculation (for users), just add fast_locpot_int T to any of them.

A preliminary GPU port of fast locpot int is in place (starting from ONETEP 7.1.50). It is activated automatically if you run a GPU-capable binary.